vcard parser crashes on non-ascii characters

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> feedparser.__version__
'5.0.1'
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.2.0'
>>> s="""
... <?xml version="1.0" encoding="UTF-8"?>
... <feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom">
... <entry>
...     <content type="html">
... &lt;div class=&quot;vcard&quot;&gt;
... &lt;span class='fn org'&gt;&#180;&lt;/span&gt;
... &lt;/div&gt;
...    </content>
...   </entry>
...   </feed>
... """
>>> feedparser.parse(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "feedparser.py", line 3822, in parse
    feedparser.feed(data.decode('utf-8', 'replace'))
  File "feedparser.py", line 1851, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "C:\dev\python27\lib\sgmllib.py", line 104, in feed
    self.goahead(0)
  File "C:\dev\python27\lib\sgmllib.py", line 143, in goahead
    k = self.parse_endtag(i)
  File "C:\dev\python27\lib\sgmllib.py", line 320, in parse_endtag
    self.finish_endtag(tag)
  File "C:\dev\python27\lib\sgmllib.py", line 360, in finish_endtag
    self.unknown_endtag(tag)
  File "feedparser.py", line 657, in unknown_endtag
    method()
  File "feedparser.py", line 1647, in _end_content
    value = self.popContent('content')
  File "feedparser.py", line 961, in popContent
    value = self.pop(tag)
  File "feedparser.py", line 868, in pop
    mfresults = _parseMicroformats(output, self.baseuri, self.encoding)
  File "feedparser.py", line 2425, in _parseMicroformats
    p.vcard = p.findVCards(p.document)
  File "feedparser.py", line 2362, in findVCards
    sVCards += u'\n'.join(arLines) + u'\n'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 4: ordinal 
not in range(128)
>>> 

What is the expected output? What do you see instead?
Expected feed to be parsed. UnicodeDecodeError thrown instead.

What version of the product are you using? On what operating system?
Windows 7 x64; Python 2.7.1; feedparser 5.0.1; BeautifulSoup 3.2.0
Same behavior happened on Ubuntu 10.10; Python 2.6.6

Please provide any additional information below.
It appears that parsing vcard properties mixes unicode strings with 
bytestrings. In this case, the ORG property is a byte string containing 
non-ASCII characters.

The patch below worked around this issue for me:
@@ -2359,7 +2359,7 @@ class _MicroformatsParser:

             if arLines:
                 arLines = [u'BEGIN:vCard',u'VERSION:3.0'] + arLines + [u'END:vCard']
-                sVCards += u'\n'.join(arLines) + u'\n'
+                sVCards += u'\n'.join(unicode(arLines)) + u'\n'

         return sVCards.strip()

Minimal Atom feed file to reproduce is attached.

Original issue reported on code.google.com by lindsey....@gmail.com on 14 Mar 2011 at 7:21

Attachments:

events.atom

GoogleCodeExporter commented 9 years ago

This is indeed an issue, good catch! I tried your patch under Python 2.4.6 and 
Python 2.7.1 using BeautifulSoup 3.2.0 for each and found that hella 
microformat unit tests break as a result. `arLines` is a list, so wrapping it 
in `unicode()` produces non-ideal results. I briefly explored other options but 
I didn't come up with anything satisfying. I'm inclined to make this block on 
issue 148 (the fix for which will likely educate us on the fixing this issue 
too), but if you review the patch in the meantime and upload a revised version 
I'll take another look at it!

Original comment by kurtmckee on 17 Mar 2011 at 4:24

Changed title: vcard parser crashes on non-ascii characters
Changed state: Accepted
Added labels: Component-Parser, Type-Defect

GoogleCodeExporter commented 9 years ago

Fixed in r386.

Original comment by kurtmckee on 20 Apr 2011 at 9:46

Changed state: Fixed

libo26 / feedparser

vcard parser crashes on non-ascii characters #264