pombreda / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

base64 decoding is too aggressive #284

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Open the following example file: 
https://gist.github.com/1019783#file_zotero.xml using Python 2.7
2. Parse using feedparser.parse()
3. Attempt to decode the content[0]['value'] strings to UTF-8

What is the expected output? What do you see instead?
The second entry should be '{"name":"A Midsummer Night's Dream, "parent":false}'
Instead, a UnicodeDecodeError is thrown:
'ascii', 
'\x9d\xa9\x9e\x00\xc8\x9d\xb2\xe9\xa6z\xb3b\x82\x1bl\x0e\xb7\x9a\x9a\x96\xabz{_j
[\x1e', 0, 1, 'ordinal not in range(128)'

What version of the product are you using? On what operating system?
Version 5.0.1, OS X (10.6.7), Python 2.7

Please provide any additional information below.

Original issue reported on code.google.com by ursch...@gmail.com on 11 Jun 2011 at 5:22

GoogleCodeExporter commented 9 years ago
I've done some more testing, and it seems that the error occurs when the JSON 
field contains more than 8 characters:

{"name":"012345678","parent":false}

will fail

{"name":"01234567","parent":false}

will not

Original comment by ursch...@gmail.com on 12 Jun 2011 at 12:46

GoogleCodeExporter commented 9 years ago
This is occurring because the content has a mimetype of `application/json`. 
That isn't a mimetype that feedparser recognizes, so it's attempting to decode 
the content string using base64. It happens that the exact string that you 
quoted above can be decoded using base64 without throwing an error. All of the 
other content strings pass through unscathed because they can't be decoded.

When I have time I'll review the Atom spec more thoroughly to confirm what the 
correct behavior should be.

http://www.atomenabled.org/developers/syndication/atom-format-spec.php#rfc.secti
on.4.1.3.1

Original comment by kurtmckee on 13 Jun 2011 at 12:34

GoogleCodeExporter commented 9 years ago

Original comment by kurtmckee on 6 Sep 2011 at 3:17

GoogleCodeExporter commented 9 years ago
Issue 288 has been merged into this issue.

Original comment by kurtmckee on 6 Sep 2011 at 3:19

GoogleCodeExporter commented 9 years ago
Issue 316 has been merged into this issue.

Original comment by kurtmckee on 12 Dec 2011 at 5:39

GoogleCodeExporter commented 9 years ago
Actually, I'd say you're doing the right thing already. According to the spec, 
application/json and application/x-csl+json (mentioned in issue 316) should be 
Base64-encoded, even if that's rather counterproductive for JSON. We'll fix 
this on our end.

More details here:

http://groups.google.com/group/zotero-dev/msg/882047943ea07ee2

Original comment by dstill...@zotero.org on 12 Dec 2011 at 9:02

GoogleCodeExporter commented 9 years ago
Thanks for the update! It does look like feedparser is following the spec, but 
feedparser also tries to gracefully allow for unexpected problems like this 
(when possible). I'm going to leave this open until I have an opportunity to 
see what options might be available to improve feedparser in this regard.

Original comment by kurtmckee on 12 Dec 2011 at 5:10