getting SAXParseException with python3

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
import feedparser
feed = 'http://www.reddit.com/.rss'
info = feedparser.parse(feed)

#debuging line
print( info )

#following line fails :(
#print( info.feed.title )

for entry in info.entries:
    print( entry.title )

What is the expected output? What do you see instead?
I expect to see the contents of the feed, but instead I only get this:
{'feed': {}, 'status': 200, 'version': '', 'encoding': 'iso-8859-1', 'bozo': 1, 
'headers': {'Content-Length': '5500', 'Content-Encoding': 'gzip', 'Set-Cookie': 
'reddit_first=%7B%22firsttime%22%3A%20%22first%22%7D; Domain=reddit.com; 
expires=Thu, 31 Dec 2037 23:59:59 GMT; Path=/', 'Vary': 'Accept-Encoding', 
'Server': "'; DROP TABLE servertypes; --", 'Connection': 'close', 'Date': 'Sun, 
22 May 2011 10:02:18 GMT', 'Content-Type': 'text/xml; charset=UTF-8'}, 'href': 
'http://www.reddit.com/.rss', 'namespaces': {}, 'entries': [], 
'bozo_exception': SAXParseException('not well-formed (invalid token)',)}

What version of the product are you using? On what operating system?
feedparser 5.0.1 converted to python3 with the included script. ubuntu 10.04 
and python3

Please provide any additional information below.
This code works exactly as expected under python2. I only have problems with 
python3.
The work around that I found is to download the feed xml to a local file first 
and then have feedparser open it. Like so:

import feedparser
import urllib.request
feed = 'http://www.reddit.com/.rss'

u = urllib.request.urlopen(feed)
fp = open('feed.xml', 'wb')
fp.write(u.read())
fp.close()

info = feedparser.parse('feed.xml')

#debuging line
#print( info )

#following line no longer fails :)
print( info.feed.title )

for entry in info.entries:
    print( entry.title )

Original issue reported on code.google.com by Yossi.Ra...@gmail.com on 22 May 2011 at 7:43

GoogleCodeExporter commented 9 years ago

I don't know if this is a related bug or not, but if I lowercase the keys in 
the http_headers dictionary in parse() like this:
http_headers = {k.lower():v for k, v in result.get('headers', {}).items()} # 
dict comp

then I still get the same error as before, but this time 'encoding': 
'iso-8859-1' becomes 'encoding': 'iso-8859-2'.

Neither of which is the correct encoding, UTF-8.

Original comment by Yossi.Ra...@gmail.com on 22 May 2011 at 8:06

GoogleCodeExporter commented 9 years ago

I'm not able to reproduce this using the feed you linked to using Python 3.0, 
3.1, or 3.2, with and without the ported sgmllib included in the source 
distribution. Would you download the latest source code, linked below, and see 
if you're still getting the error? If you are, please use wget or curl to grab 
the feed and upload it as an attachment to this bug report.

https://feedparser.googlecode.com/svn/trunk/feedparser/feedparser.py

Original comment by kurtmckee on 29 May 2011 at 6:16

Changed title: getting SAXParseException with python3

GoogleCodeExporter commented 9 years ago

Original comment by kurtmckee on 29 May 2011 at 6:18

Changed state: NeedInfo

GoogleCodeExporter commented 9 years ago

i just tested this now with r554 and the bug appears to have been fixed.

i think i was using r377 (?) earlier as that is what comes in the "official 
download" for 5.0.1

Original comment by Yossi.Ra...@gmail.com on 29 May 2011 at 10:44

GoogleCodeExporter commented 9 years ago

Glad to hear it, thanks for the quick response!

Original comment by kurtmckee on 30 May 2011 at 8:39

Changed state: Fixed

pombreda / feedparser

getting SAXParseException with python3 #280