kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.97k stars 342 forks source link

Not Well Formed RSS Feed #101

Open MrTyton opened 7 years ago

MrTyton commented 7 years ago

Hi,

I've been getting a bozo exception like the following:

`In [5]: feedparser.parse("owlturd.com/rss")

Out[5]:

{'bozo': 1,

'bozo_exception': xml.sax._exceptions.SAXParseException('not well-formed (invalid token)'),

'encoding': u'utf-8',

'entries': [],

'feed': {},

'namespaces': {},

'version': u''} `

A manual inspection of the feed says that it's encoded in UTF-8, and some looking around tells me that the exception means that it's badly encoded, there's something wrong with it. As is though there's no way for me to tell the parser to ignore the UTF-8 errors and parse what it can without downloading the feed and doing it manually, then passing it into the parser function, which seems inefficient. Is there another way around it or some way to pass in more arguments to feedparser.parse()? There's no actual documentation for that that I can see.

evdoks commented 6 years ago

Use http://owlturd.com/rss instead of owlturd.com/rss

deepakmishra commented 6 years ago

Even I am facing the same problem. Can you help me parse this feed? It's working fine on feedparser==4.1 but not on latest one. https://www.prabhasakshi.com/feed.aspx?cat_id=14

buhtz commented 6 years ago

The problem with the feed depends on the feed itself. Please contact the feed author/generator. This issue is not related to feedparser itself. Can be closed.

deepakmishra commented 6 years ago

Yes, you can close it. I found a workaround.

    p = feedparser.parse(url)
    if not p['entries'] and "not well-formed" in str(p['bozo_exception']):
        rss1 = requests.get(url).content.decode("utf-16")
        rss1 = rss1.replace("utf-16","unicode")
        p = feedparser.parse(rss1)
    entries = p['entries']
buhtz commented 6 years ago

Technicaly your approach is a solution. But something for thought-provoking: We shouldn't write Newsfeed-Clients which accept corrupt data not fitting to a standard. Our clients should motivate the user to contact the feed owner.

buhtz commented 5 years ago

Please close.