loose parser doesn't always return unicode strings

libo26 / feedparser

Automatically exported from code.google.com/p/feedparser

Other

0 stars 0 forks source link

loose parser doesn't always return unicode strings #148

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

The dicts in "f.entries[].enclosures" may contain bytestrings if the feed
is bozo, since unlike tag contents, which are converted to unicode in
push(), the attributes of the enclosure tag are appended to the list
directly, as returned from the parser.

Via #131, this also affects the item ids.

Original issue reported on code.google.com by elsdoer...@gmail.com on 8 Dec 2008 at 2:40

GoogleCodeExporter commented 9 years ago

The same is true for the feed image url.

Original comment by elsdoer...@gmail.com on 12 Dec 2008 at 1:49

GoogleCodeExporter commented 9 years ago

Original comment by elsdoer...@gmail.com on 12 Dec 2008 at 2:44

Attachments:

148.diff

GoogleCodeExporter commented 9 years ago

Do you have a sample document that demonstrates this problem?

I spent a week sorting through inconsistent str/unicode handling while porting 
feedparser to Python 3, and I'd like to see if the changes I made will fix this 
problem when they're merged back into the project. (Issue 215 is where Python 3 
work is being tracked; you can star that issue to be notified as it progresses.)

Original comment by kurtmckee on 6 Dec 2010 at 10:45

GoogleCodeExporter commented 9 years ago

The patch I have posted includes test cases; that is, it changes a bunch of 
existing test cases from ./illformed to check for a proper unicode type.

FWIW, this doesn't just affect the enclosure data, as I originally reported, 
but a bunch of other values as well; apparently mostly when taken from 
attribute values.

The problem is that unlike tag contents, which are already normalized to 
unicode, certain attribute vales are taken directly, as-is, from whatever is 
returned by the parser. Depending on what parser is used (loose, strict, ...), 
the behaviour can differ, and the loose parser seems to return bytestrings.

Original comment by elsdoer...@gmail.com on 7 Dec 2010 at 2:24

GoogleCodeExporter commented 9 years ago

I got fairly intimate with feedparser during Thanksgiving week, and you're 
right, it plays fast and loose with str/unicode and relies heavily on Python to 
automatically typecast between str and unicode. Python 3 will never typecast 
between bytes and str, and because every test passes in Python 3 without 
modifying the unit tests, I'm optimistic that this may be fixed when those 
changes are merged into trunk.

Original comment by kurtmckee on 7 Dec 2010 at 6:53

GoogleCodeExporter commented 9 years ago

Marked as accepted. Once we merge in the Python 3 changes we can come back to 
this

Original comment by adewale on 13 Dec 2010 at 1:44

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

I've checked into this now that the Python 3 changes are in, and this is indeed 
a problem. I modified the test cases and then ran them through the Python 
debugger, and it appears that this is in part related to line 1839, in which 
Python 2 interpreters re-encode the `unicode` to a UTF-8 `str` object.

I tried changing this behavior but ~250 tests started throwing errors. Happily 
they all belong to only two or three classes of errors, so after the next 
release I'll work to fix this.

Original comment by kurtmckee on 10 Jan 2011 at 1:45

GoogleCodeExporter commented 9 years ago

Original comment by kurtmckee on 15 Jan 2011 at 7:02

GoogleCodeExporter commented 9 years ago

Original comment by kurtmckee on 19 Jan 2011 at 4:53

Changed title: loose parser doesn't always return unicode strings
Added labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

This is fixed in r537.

Original comment by kurtmckee on 26 Apr 2011 at 6:35

Changed state: Fixed