google-code-export / feedparser

Automatically exported from code.google.com/p/feedparser
Other
1 stars 0 forks source link

Ampersand in URLs are handled incorrectly on Windows (XP?) #357

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Parse a feed with URLs containing ampersand 

Expected output:
http://somehost.net/torrents.php?action=download&authkey=removed&torrent_pass=re
moved&id=137212

The link in items dictionary looks like this:
http://somehost.net/torrents.php?action=download&ampauthkey=removed&amptorrent_p
ass=removed&ampid=137212

What version of the product are you using? On what operating system?
This problem occurs only on Windows (tested only Windows XP)

feedparser v5.1
Please provide any additional information below.

Original issue reported on code.google.com by bendi...@gmail.com on 17 May 2012 at 10:43

GoogleCodeExporter commented 9 years ago
Actually the problem is caused by xml.sax not being available.
I encountered this bug in the Deluge torrent client which includes a limited 
version of Python 2.6 so _XML_AVAILABLE is set to 0 when i fails to import 
xml.sax.

The problem is caused by this code:

# query variables in urls in link elements are improperly
# converted from `?a=1&b=2` to `?a=1&b;=2` as if they're 
# unhandled character references. fix this special case.
output = re.sub("&([A-Za-z0-9_]+);", "&\g<1>", output)

When the URL looks like this:
http://somehost.net/torrents.php?action=download&id=9631475

This code converts in into this:
http://somehost.net/torrents.php?action=download&id=9631475

As I'm not entirely sure which cases this code is supposed to fix, I added a 
bulletproof fix by placing this line above the re.sub above
output = re.sub("&", "&", output)

Reproduce issue by raising ImportError here:
try:
    raise ImportError() 
    import xml.sax
    from xml.sax.saxutils import escape as _xmlescape
except ImportError:
    _XML_AVAILABLE = 0

And try parsing the following rss file:
<rss version="2.0">
  <channel>
    <title>Site tile</title>
    <link>Site url</link>
    <description>Description of RSS Feed</description>
    <language>en-us</language>
    <ttl>120</ttl>          <item>
    <title>Some title</title>
    <link>http://hostname.com/Fetch?hash=2f21d4e59&digest=865178f9bc</link>
    <guid isPermaLink="true">http://hostname.com/Fetch?hash=2f21d4e59&digest=865178f9bc</guid>
    <comments>Some comment</comments>
    <pubDate>Thu, 15 May 2012 00:16:18 +0000</pubDate>
    <description>Detailed description</description>
    <enclosure url="http://hostname.com/Fetch?hash=2f21d4e59&id=865178f9bc" length="3423659" type="application/x-bittorrent"/>
  </item>
</channel>
</rss>

Original comment by bendi...@gmail.com on 17 May 2012 at 11:51