kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.99k stars 342 forks source link

Failed to parse description field with escaped CDATA. #440

Open cdhigh opened 7 months ago

cdhigh commented 7 months ago

Bug Description: Up to the current version (2024-04-12), if the description field contains escaped CDATA, feedparser fails to extract the content. I have simplified the issue and provided a minimal reproducible test case ( source RSS link ).

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>xueqiu</title>
    <link>http://xueqiu.com/hots/topic</link>
    <description>xiuqiu</description>
    <item>
      <title>title</title>
      <link>http://xueqiu.com/1630191122/288006046</link>
      <description>&lt;![CDATA[some text]]&gt;</description>
      <pubDate>Sat, 27 Apr 2024 08:26:02 GMT</pubDate>
      <guid>http://xueqiu.com/1630191122/288006046</guid>
      <dc:creator>name</dc:creator>
      <dc:date>2024-04-27T08:26:02Z</dc:date>
    </item>
  </channel>
</rss>

Expectation: feed.entries[0].description=='some text', but the actual result is an empty string. If &lt;![CDATA[some text]]&gt; is changed to <![CDATA[some text]]>, then it works fine.

lucasjinreal commented 2 months ago

Sam issue?

smmorneau commented 2 weeks ago

I'm having the same issue with this feed.