Parsing takes minutes (Raspberry Pi)

google-code-export / feedparser

Automatically exported from code.google.com/p/feedparser

Other

1 stars 0 forks source link

What steps will reproduce the problem? 1. Run following python program with any of the feeds listed. Uncomment feed to test #--------start source import feedparser print "Parse" # this one is fast, takes about 8 seconds #feed = feedparser.parse("http://www.engadget.com/rss.xml") # next one is about 60 seconds #feed = feedparser.parse("http://www.eevblog.com/feed/") # these are the slow ones. Takes minutes to parse: #feed = feedparser.parse("http://www.heise.de/newsticker/heise-atom.xml") #feed = feedparser.parse("http://www.spiegel.de/schlagzeilen/tops/index.rss ") print "Done Parse" size = len(feed['entries']) for i in range(0,size): print feed['entries'][i].title #-------- end source What is the expected output? What do you see instead? Output is as expected, parsing takes more than 4 minutes per feed. What version of the product are you using? On what operating system? Raspberry Pi with standard image Python 2.7.3 feedparser 5.1.3 BeautifulSoup 3.2.1 (and bs4 installed) Please provide any additional information below. Parsing is extremely slow. Seems not to be an encoding issue, as eevblog-feed is slow too. Not as slow as German feeds, but slow. I have included an output from running "python myprogram.py --mcProfile" with the "www.heise.de"-feed These following lines seem to use the most time: 1538704 48.838 0.000 52.868 0.000 codingstatemachine.py:40(next_state) 284 31.191 0.110 56.701 0.200 mbcharsetprober.py:52(feed) 1425 87.178 0.061 115.210 0.081 sbcharsetprober.py:63(feed) 95 18.991 0.200 42.172 0.444 utf8prober.py:50(feed) Running the above command did result in errors: Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/cProfile.py", line 199, in <module> main() File "/usr/lib/python2.7/cProfile.py", line 192, in main runctx(code, globs, None, options.outfile, options.sort) File "/usr/lib/python2.7/cProfile.py", line 49, in runctx prof = prof.runctx(statement, globals, locals) File "/usr/lib/python2.7/cProfile.py", line 140, in runctx exec cmd in globals, locals File "feedreader.py", line 14, in <module> print feed['entries'][i].title UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 19: ordinal not in range(128) Perhaps this helps in figuring out what went wrong. I've searched for days now and still don't have a clue... Hopefully my description makes sense to anyone... On my Mac the program runs extremely fast.

I just ran the tests to complete my description. There were 3 failures:
======================================================================
FAIL: test_001279 (__main__.TestStrictParser)
./tests/wellformed/mf_hcard/3-5-5-org-unicode.xml: hcard contains non-ascii 
character
----------------------------------------------------------------------
Traceback (most recent call last):
  File "feedparsertest.py", line 752, in <lambda>
    self.failUnlessEval(xmlfile, evalString)
  File "feedparsertest.py", line 166, in failUnlessEval
    raise self.failureException, failure
AssertionError: not eval(not bozo and entries[0]['vcard'] == 
u"BEGIN:vCard\nVERSION:3.0\nORG:\u00b4\nEND:vCard") 
WITH env({'bozo': 0,
 'encoding': u'utf-8',
 'entries': [{'content': [{'base': u'',
                           'language': None,
                           'type': u'text/html',
                           'value': u'<div class="vcard">\n<span class="org">\xb4</span>\n</div>'}],
              'summary': u'<div class="vcard">\n<span class="org">\xb4</span>\n</div>',
              'vcard': u'BEGIN:vCard\nVERSION:3.0\nORG:\xc2\xb4\nEND:vCard'}],
 'feed': {},
 'namespaces': {'': u'http://www.w3.org/2005/Atom'},
 'version': u'atom10'})

======================================================================
FAIL: test_001279 (__main__.TestLooseParser)
./tests/wellformed/mf_hcard/3-5-5-org-unicode.xml: hcard contains non-ascii 
character
----------------------------------------------------------------------
Traceback (most recent call last):
  File "feedparsertest.py", line 752, in <lambda>
    self.failUnlessEval(xmlfile, evalString)
  File "feedparsertest.py", line 166, in failUnlessEval
    raise self.failureException, failure
AssertionError: not eval(not bozo and entries[0]['vcard'] == 
u"BEGIN:vCard\nVERSION:3.0\nORG:\u00b4\nEND:vCard") 
WITH env({'bozo': 0,
 'encoding': u'utf-8',
 'entries': [{'content': [{'base': u'',
                           'language': None,
                           'type': u'text/html',
                           'value': u'<div class="vcard">\n<span class="org">\xb4</span>\n</div>'}],
              'summary': u'<div class="vcard">\n<span class="org">\xb4</span>\n</div>',
              'vcard': u'BEGIN:vCard\nVERSION:3.0\nORG:\xc2\xb4\nEND:vCard'}],
 'feed': {},
 'namespaces': {'': u'http://www.w3.org/2005/Atom'},
 'version': u'atom10'})

======================================================================
FAIL: test_000018 (__main__.TestMicroformats)
./tests/microformats/hcard/3-1-1-fn-unicode-char.xml: unicode character in 
microformat
----------------------------------------------------------------------
Traceback (most recent call last):
  File "feedparsertest.py", line 752, in <lambda>
    self.failUnlessEval(xmlfile, evalString)
  File "feedparsertest.py", line 166, in failUnlessEval
    raise self.failureException, failure
AssertionError: not eval(not bozo and entries[0].vcard == 
u'BEGIN:vCard\nVERSION:3.0\nFN:Tantek 
\xc7elik\nN:\xc7elik;Tantek\nURL:http://tantek.com/\nEND:vCard') 
WITH env({'bozo': 0,
 'encoding': u'utf-8',
 'entries': [{'content': [{'base': u'',
                           'language': None,
                           'type': u'text/html',
                           'value': u'<span class="vcard"><a class="url fn" href="http://tantek.com/">Tantek \xc7elik</a></span>'}],
              'summary': u'<span class="vcard"><a class="url fn" href="http://tantek.com/">Tantek \xc7elik</a></span>',
              'vcard': u'BEGIN:vCard\nVERSION:3.0\nFN:Tantek \u0102&Dagger\\;elik\nN:\u0102&Dagger\\;elik;Tantek\nURL:http://tantek.com/\nEND:vCard'}],
 'feed': {},
 'namespaces': {'content': u'http://purl.org/rss/1.0/modules/content/'},
 'version': u'rss20'})

----------------------------------------------------------------------
Ran 4384 tests in 314.721s

Original comment by drthomas...@googlemail.com on 16 Nov 2013 at 10:24

Thanks for this information! I appreciate that you ran the unit tests, but don't fret about those test failures -- I've removed the microformat parsing completely and that'll be in the next release of feedparser. Quick question, have you tried disabling the HTML sanitization for additional speed comparisons? Also, uninstalling BeautifulSoup may really help speed things up. The microformat code was very slow. I have an rpi and may have an opportunity to test this in the future.

google-code-export / feedparser

Parsing takes minutes (Raspberry Pi) #419