kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.98k stars 342 forks source link

AssertionError: unknown status keyword 'dsgvo_service_control' in marked section #468

Open snarfed opened 3 months ago

snarfed commented 3 months ago

Hi! First off, huge thanks for maintaining feedparser. It's legendary! We're all lucky to have it.

I hit a new (to me) AssertionError today when parsing the RSS at https://snrk.de/feed/ . Here's the relevant RSS snippet:

<content:encoded><![CDATA[
  ...
  <p><strong>If you don&#8217;t like that, don&#8217;t use snrk.de!</strong><![dsgvo_service_control]></p>
  ...
]]></content:encoded>

...and here's the assert:

>>> feedparser.parse(rss)
Traceback (most recent call last):
  File ".../site-packages/feedparser/api.py", line 263, in parse
    saxparser.parse(source)
  File ".../python3.11/xml/sax/expatreader.py", line 111, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File ".../python3.11/xml/sax/xmlreader.py", line 125, in parse
    self.feed(buffer)
  File ".../python3.11/xml/sax/expatreader.py", line 217, in feed
    self._parser.Parse(data, isFinal)
  File "/private/tmp/pythonA3.11-20240402-4978-3ygh5v/Python-3.11.9/Modules/pyexpat.c", line 477, in EndElement
  File ".../python3.11/xml/sax/expatreader.py", line 395, in end_element_ns
    self._cont_handler.endElementNS(pair, None)
  File ".../site-packages/feedparser/parsers/strict.py", line 124, in endElementNS
    self.unknown_endtag(localname)
  File ".../site-packages/feedparser/mixin.py", line 321, in unknown_endtag
    method()
  File ".../site-packages/feedparser/namespaces/_base.py", line 488, in _end_content
    value = self.pop_content('content')
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/feedparser/mixin.py", line 629, in pop_content
    value = self.pop(tag)
            ^^^^^^^^^^^^^
  File ".../site-packages/feedparser/mixin.py", line 548, in pop
    output = _sanitize_html(output, self.encoding, self.contentparams.get('type', 'text/html'))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/feedparser/sanitizer.py", line 883, in _sanitize_html
    p.feed(html_source)
  File ".../site-packages/feedparser/html.py", line 156, in feed
    super(_BaseHTMLProcessor, self).feed(data)
  File ".../site-packages/sgmllib.py", line 98, in feed
    self.goahead(0)
  File ".../site-packages/sgmllib.py", line 168, in goahead
    k = self.parse_declaration(i)
        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/feedparser/html.py", line 351, in parse_declaration
    return sgmllib.SGMLParser.parse_declaration(self, i)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../python3.11/_markupbase.py", line 91, in parse_declaration
    return self.parse_marked_section(i)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../python3.11/_markupbase.py", line 154, in parse_marked_section
    raise AssertionError(
AssertionError: unknown status keyword 'dsgvo_service_control' in marked section

Is this expected? Should I catch AssertionError everywhere I use feedparser? Any other thoughts?

feedparser 6.0.11, Python 3.11.9. Maybe related to #378...but not exactly the same. Thanks in advance!

kurtmckee commented 3 months ago

Thanks for the kind words! This is definitely unexpected, and I'll take a look at this. For now, it may be necessary to catch AssertionError. :disappointed:

PaulKalbitzer commented 2 months ago

We were able to trigger a similar assertion.

"unknown status keyword 'n' in marked section"

We were able to narrow down the cause of the problem to the following segment in our input.

<description >XC#&lt;![n%</description>

We think it is the character combination <![ or as well as &lt;![ or **&#60;![**, which effectively renders to <![.

The problem seems to be the parsing of marked sections, from the error trace we could see that 'parse_marked_section' is mistakenly called, although it is not a marked section.