google-code-export / feedparser

Automatically exported from code.google.com/p/feedparser
Other
1 stars 0 forks source link

unicode content is utf8-encoded then sniffed for an XML encoding #378

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hecko there!

The parsing is only happening at a later stage in our system. Prior to that we 
are working with the Unicode representation of the feed XML. Unfortunately, 
feedparser re-encodes the Unicode string using UTF-8 before feeding it into a 
StringIO, but does not remember this choice of encoding, so that it tries to 
guess the encoding again at a later stage. If the feed was not originally 
UTF-8-encoded and contains an encoding specification that feedparser 
understands, then the subsequent decoding fails.

The feed below was originally ISO-8859-15-encoded. “Umsätze” is the 
correct spelling; “UmsÀtze” is not.

{{{
>>> import feedparser
>>> import requests
>>> feedparser.__version__
'5.1.2'
>>> resp = requests.get('http://www.ibusiness.de/export/rss.xml?format=rss20')
>>> text = resp.content.decode('iso-8859-15')
>>> print text[1000:1200]
08:56:08 +0200</pubDate>
    </item>
    <item>
      <title>BVH-Prognose: ECommerce-Umsätze steigen 2012 auf 27,5 Milliarden Euro </title>
      <description>
         <![CDATA[]]>
      </descriptio
>>> feed = feedparser.parse(text)
>>> print feed.entries[1]['title']
BVH-Prognose: ECommerce-UmsÀtze steigen 2012 auf 27,5 Milliarden Euro
}}}

Thank you!

Original issue reported on code.google.com by tel...@gmail.com on 18 Oct 2012 at 4:46

GoogleCodeExporter commented 9 years ago
Good catch. I'll work to fix this.

It's not possible to avoid the encode/decode if anything has to pass through 
sgmllib because it contains a few code paths where unicode coersion will fail. 
However, it should be possible to avoid damaging the content.

Original comment by kurtmckee on 19 Nov 2012 at 4:05