lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
456 stars 38 forks source link

reader treats all bozo feeds as errors #270

Open lemon24 opened 2 years ago

lemon24 commented 2 years ago

reader treats all bozo feeds as errors, even if the loose parser managed to parse them:

<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>title</title>
  <updated>2021-12-18T11:00:00</updated>
  <id>http://example.com/</id>
  <entry>
    <id>http://example.com/entry</id>
    <updated>2021-07-29T00:00:00</updated>
    <content type="html">
        &#39; &amp; &gt; &ldquo; &lt; &quot; &rdquo; &rsquo;
    </content>
  </entry>
</feed>
{
    'bozo': 1,
    'bozo_exception': SAXParseException('undefined entity'),
    'encoding': 'utf-8',
    'entries': [
        {
            'content': [
                {
                    'base': '',
                    'language': None,
                    'type': 'text/html',
                    'value': '\' & > “ < " ” ’',
                }
            ],
            'id': 'http://example.com/entry',
            'summary': '\' & > “ < " ” ’',
            ...
        }
    ],
    'feed': {
        'id': 'http://example.com/',
        'title': 'title',
        ...
    },
    'headers': {},
    'namespaces': {'': 'http://www.w3.org/2005/Atom'},
    'version': 'atom10',
}

We still need a heuristic to tell that apart from complete garbage (version, and the presence of entries?):

>>> feedparser.parse("garbage")
{'bozo': 1, 'entries': [], 'feed': {}, 'headers': {}, 'encoding': 'utf-8', 'version': '', 'bozo_exception': SAXParseException('syntax error'), 'namespaces': {}}
lemon24 commented 2 years ago

Some conclusions from playing with the Atom feed below: