reader treats all bozo feeds as errors

reader treats all bozo feeds as errors, even if the loose parser managed to parse them:

<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>title</title>
  <updated>2021-12-18T11:00:00</updated>
  <id>http://example.com/</id>
  <entry>
    <id>http://example.com/entry</id>
    <updated>2021-07-29T00:00:00</updated>
    <content type="html">
        &#39; &amp; &gt; &ldquo; &lt; &quot; &rdquo; &rsquo;
    </content>
  </entry>
</feed>

{
    'bozo': 1,
    'bozo_exception': SAXParseException('undefined entity'),
    'encoding': 'utf-8',
    'entries': [
        {
            'content': [
                {
                    'base': '',
                    'language': None,
                    'type': 'text/html',
                    'value': '\' & > “ < " ” ’',
                }
            ],
            'id': 'http://example.com/entry',
            'summary': '\' & > “ < " ” ’',
            ...
        }
    ],
    'feed': {
        'id': 'http://example.com/',
        'title': 'title',
        ...
    },
    'headers': {},
    'namespaces': {'': 'http://www.w3.org/2005/Atom'},
    'version': 'atom10',
}

We still need a heuristic to tell that apart from complete garbage (version, and the presence of entries?):

>>> feedparser.parse("garbage")
{'bozo': 1, 'entries': [], 'feed': {}, 'headers': {}, 'encoding': 'utf-8', 'version': '', 'bozo_exception': SAXParseException('syntax error'), 'namespaces': {}}

Some conclusions from playing with the Atom feed below:

xml.sax.SAXParseException "undefined entity" is survivable.
"mismatched tag" is not; we get all the good entries, and then the broken entry, in a bad state (e.g. all content in ); entries after it are missing, but not always.</li> <li>It may be worth finding what other kinds of errors can be encountered... (<a href="https://github.com/libexpat/libexpat/blob/81b89678e200820271b72cacdd45fb5868855765/expat/lib/xmlparse.c#L2349">all of them</a>).</li> </ul> <p>Also, when the loose parser is used, the feed should be considered stale; that is, we should always prefer entries from the non-broken feed.</p> <p>I'm thinking of something like this:</p> <table> <thead> <tr> <th>existing</th> <th>parsed</th> <th>desired behavior</th> <th>current behavior</th> </tr> </thead> <tbody> <tr> <td>none</td> <td>any</td> <td>use new (any)</td> <td>yes</td> </tr> <tr> <td>any</td> <td>strict</td> <td>use new (strict)</td> <td>yes (hash takes care of it)</td> </tr> <tr> <td>strict</td> <td>loose</td> <td>keep old (strict)</td> <td>no (different hash => update)</td> </tr> <tr> <td>loose</td> <td>loose</td> <td>use new (loose)</td> <td>yes (hash takes care of it)</td> </tr> </tbody> </table> <p>This would favor feeds that are temporarily broken, and eventually get fixed. For feeds that become permanently broken, it results in old strict entries not receiving updates.</p> <pre><code class="language-xml"><?xml version="1.0" encoding="utf-8"?> <feed xmlns="http://www.w3.org/2005/Atom"> <entry> <id>one</id> <title>1</title> <summary>i</summary> </entry> <entry> <id>two</id> <title>Atom-Powered Robots Run Amok <summary>Summary.&veryundefinedentity; <content>Content.</content> </entry> <entry> <id>three</id> <title>3</title> <summary>iii</summary> </entry> </feed></code></pre> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>

lemon24 / reader

reader treats all bozo feeds as errors #270