unicode content is utf8-encoded then sniffed for an XML encoding

Hecko there!

The parsing is only happening at a later stage in our system. Prior to that we 
are working with the Unicode representation of the feed XML. Unfortunately, 
feedparser re-encodes the Unicode string using UTF-8 before feeding it into a 
StringIO, but does not remember this choice of encoding, so that it tries to 
guess the encoding again at a later stage. If the feed was not originally 
UTF-8-encoded and contains an encoding specification that feedparser 
understands, then the subsequent decoding fails.

The feed below was originally ISO-8859-15-encoded. “Umsätze” is the 
correct spelling; “UmsÃ€tze” is not.

{{{
>>> import feedparser
>>> import requests
>>> feedparser.__version__
'5.1.2'
>>> resp = requests.get('http://www.ibusiness.de/export/rss.xml?format=rss20')
>>> text = resp.content.decode('iso-8859-15')
>>> print text[1000:1200]
08:56:08 +0200</pubDate>
    </item>
    <item>
      <title>BVH-Prognose: ECommerce-Umsätze steigen 2012 auf 27,5 Milliarden Euro </title>
      <description>
         <![CDATA[]]>
      </descriptio
>>> feed = feedparser.parse(text)
>>> print feed.entries[1]['title']
BVH-Prognose: ECommerce-UmsÃ€tze steigen 2012 auf 27,5 Milliarden Euro
}}}

Thank you!

Original issue reported on code.google.com by tel...@gmail.com on 18 Oct 2012 at 4:46

google-code-export / feedparser

unicode content is utf8-encoded then sniffed for an XML encoding #378