kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.97k stars 341 forks source link

Parser improperly following links within a CDATA #148

Open Dizzy611 opened 6 years ago

Dizzy611 commented 6 years ago

Hi! I'm using feedparser to great effect with a project to parse various RSS feeds from a site I help administrate for posting on the attached Discord server. When parsing the feed linked here: https://www.digitalmzx.net/forums/index.php?app=core&module=global&section=rss&type=tracker&id=9 feedparser seems to be following the links given instead of just parsing them. That is to say, all the links end up like this:

>>> import feedparser
>>> lolz = feedparser.parse("https://www.digitalmzx.net/forums/index.php?app=core&module=global&section=rss&type=tracker&id=9")
>>> lolz.entries[0]
>>> lolz.entries[0].link
'https://www.digitalmzx.net/forums/index.php?s=ce2eb486552d7034c0a723583b40dd5b&app=tracker&showissue=726'

whereas the actual link given in the RSS feed lacks the session token (the "?s=numberandletters" bit): <link><![CDATA[https://www.digitalmzx.net/forums/index.php?app=tracker&showissue=726]]></link> This appears to be because the RSS feed contains the link inside a CDATA tag. From my admittedly limited understanding of CDATA, this seems to be somewhat backwards from the intent of CDATA, which is that things within a CDATA tag should not be parsed.

The reason I believe this to be due to the CDATA tag is because when parsing a feed where the links are given bare, even though the link is for the same forum software and also adds session tokens via forwarding, this does not happen. For example, when parsing this feed: https://www.digitalmzx.net/forums/index.php?app=core&module=global&section=rss&type=forums&id=1 which has links that look like this: <link>https://www.digitalmzx.net/forums/index.php?showtopic=15839</link> the issue does not occur.

I have, for the moment, relied on a regex in my code to turn ?s=letterandnumbers&app= to ?app= to circumvent this issue as there seems to be no way to get feedparser to give me the links as they are given in the rss feed.

Thanks for taking the time to look at my issue, Dylan M

Billybangleballs commented 5 years ago

I am having a similar problem.

2018-12-18 10:58:01+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/feedparser.py", line 356, in __getitem__
2018-12-18 10:58:01+0000 [-]     return dict.__getitem__(self, key)

the feed causing the issue is http://www.xinhuanet.com/english/rss/scirss.xml

<item>
<title>
<![CDATA[
<a href='http://news.xinhuanet.com/english/2017-05/30/c_136324734.htm' target='_blank'>Rice first domesticated in China at about 10,000 years ago: study</a>
]]>
</title>
<alink>
http://news.xinhuanet.com/english/2017-05/30/c_136324734.htm
</alink>
<description>
<![CDATA[
<img src="../titlepic/112105/1121056906_1496105003752_title0h.jpg" width="100" height="100" alt="Rice first domesticated in China at about 10,000 years ago: study" />Rice, one of the world's most important staple foods sustaining more than half of the global population, was first domesticated in China about 10,000 years ago, a new study suggested Monday.
]]>
</description>
<category>xinhuanet</category>
<author/>
<pubDate>2017-05-30</pubDate>
</item>
kurtmckee commented 5 years ago

Wow, that is strange behavior!

CDATA sections are simply a way to avoid escaping content that contains reserved XML characters like "<". It doesn't imply a change in parser behavior.

That said, feedparser shouldn't be making additional requests for embedded HTML URL's at any time, but the addition of a session ID definitely suggests that that's happening! I'll investigate as soon as possible!

On December 18, 2018 11:54:22 AM UTC, Billybangleballs notifications@github.com wrote:

I am having a similar problem.

2018-12-18 10:58:01+0000 [-]   File
"/usr/local/lib/python2.7/dist-packages/feedparser.py", line 356, in
__getitem__
2018-12-18 10:58:01+0000 [-]     return dict.__getitem__(self, key)

the feed causing the issue is http://www.xinhuanet.com/english/rss/scirss.xml

<item>
<title>
<![CDATA[
<a href='http://news.xinhuanet.com/english/2017-05/30/c_136324734.htm'
target='_blank'>Rice first domesticated in China at about 10,000 years
ago: study</a>
]]>
</title>
<alink>
http://news.xinhuanet.com/english/2017-05/30/c_136324734.htm
</alink>
<description>
<![CDATA[
<img src="../titlepic/112105/1121056906_1496105003752_title0h.jpg"
width="100" height="100" alt="Rice first domesticated in China at about
10,000 years ago: study" />Rice, one of the world's most important
staple foods sustaining more than half of the global population, was
first domesticated in China about 10,000 years ago, a new study
suggested Monday.
]]>
</description>
<category>xinhuanet</category>
<author/>
<pubDate>2017-05-30</pubDate>
</item>

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kurtmckee/feedparser/issues/148#issuecomment-448195702

Billybangleballs commented 5 years ago

In addition, the rss feed I mentioned as breaking feedparser, is itself broken and does not validate. And the OP's feed, https://www.digitalmzx.net/forums/index.php?app=core&module=global&section=rss&type=tracker&id=9 doesn't validate either.

That being said, a bad feed shouldn't break feedparser, it should keep calm and carry on.