Closed GoogleCodeExporter closed 9 years ago
The same is true for the feed image url.
Original comment by elsdoer...@gmail.com
on 12 Dec 2008 at 1:49
Do you have a sample document that demonstrates this problem?
I spent a week sorting through inconsistent str/unicode handling while porting
feedparser to Python 3, and I'd like to see if the changes I made will fix this
problem when they're merged back into the project. (Issue 215 is where Python 3
work is being tracked; you can star that issue to be notified as it progresses.)
Original comment by kurtmckee
on 6 Dec 2010 at 10:45
The patch I have posted includes test cases; that is, it changes a bunch of
existing test cases from ./illformed to check for a proper unicode type.
FWIW, this doesn't just affect the enclosure data, as I originally reported,
but a bunch of other values as well; apparently mostly when taken from
attribute values.
The problem is that unlike tag contents, which are already normalized to
unicode, certain attribute vales are taken directly, as-is, from whatever is
returned by the parser. Depending on what parser is used (loose, strict, ...),
the behaviour can differ, and the loose parser seems to return bytestrings.
Original comment by elsdoer...@gmail.com
on 7 Dec 2010 at 2:24
I got fairly intimate with feedparser during Thanksgiving week, and you're
right, it plays fast and loose with str/unicode and relies heavily on Python to
automatically typecast between str and unicode. Python 3 will never typecast
between bytes and str, and because every test passes in Python 3 without
modifying the unit tests, I'm optimistic that this may be fixed when those
changes are merged into trunk.
Original comment by kurtmckee
on 7 Dec 2010 at 6:53
Marked as accepted. Once we merge in the Python 3 changes we can come back to
this
Original comment by adewale
on 13 Dec 2010 at 1:44
I've checked into this now that the Python 3 changes are in, and this is indeed
a problem. I modified the test cases and then ran them through the Python
debugger, and it appears that this is in part related to line 1839, in which
Python 2 interpreters re-encode the `unicode` to a UTF-8 `str` object.
I tried changing this behavior but ~250 tests started throwing errors. Happily
they all belong to only two or three classes of errors, so after the next
release I'll work to fix this.
Original comment by kurtmckee
on 10 Jan 2011 at 1:45
Original comment by kurtmckee
on 15 Jan 2011 at 7:02
Original comment by kurtmckee
on 19 Jan 2011 at 4:53
This is fixed in r537.
Original comment by kurtmckee
on 26 Apr 2011 at 6:35
Original issue reported on code.google.com by
elsdoer...@gmail.com
on 8 Dec 2008 at 2:40