parsing slow with NASA feeds

GoogleCodeExporter commented 9 years ago

http://blogs.nasa.gov/cm/rssfeed/blog/whatonearth.rss?et_cw=850&et_ch=600
http://blogs.nasa.gov/cm/rssfeed/blog/icescape.rss?et_cw=850&et_ch=600

Not sure what goes wrong with these feeds. Parsing takes up to 8min with 1.6GHZ 
CPU. Today the number of entries is 63/115.

Original issue reported on code.google.com by noi...@gmail.com on 28 Nov 2011 at 5:15

GoogleCodeExporter commented 9 years ago

Wow, that doesn't sound right. I need additional information, and I'd like you 
to try a few things to help me figure out where the problem is.

Info I need:

* What version of Python are you running?
* What version of feedparser are you using?
* Do you have BeautifulSoup installed? What version?
* Do you have an XML parser installed?

What I'd like you to try:

* Download the latest code from the svn repository and try using that. I've 
included a link below.
* Try saving both feeds to your hard drive and then parse the local files.

You can download the very latest code directly from:

https://feedparser.googlecode.com/svn/trunk/feedparser/feedparser.py

I have some confidence that the problem is either that you have an old version 
of feedparser installed, or there's an issue somewhere that's unrelated to 
feedparser. Unfortunately, I'm not able to reproduce the problem, but hopefully 
following the steps above will help me figure out what's going on!

Original comment by kurtmckee on 29 Nov 2011 at 3:10

GoogleCodeExporter commented 9 years ago

It's running inside app engine SDK, ubuntu 10.10, python 2.5 from deadsnakes.

__version__ = "5.0.1" (feedparser)

Will Beautifulsoup speed up? Is it advised anyway?

The other 200 feeds tested are parsed within seconds or less.

Original comment by noi...@gmail.com on 30 Nov 2011 at 6:47

GoogleCodeExporter commented 9 years ago

I'm running Ubuntu 10.10, and I have Python 2.5 installed from source (but I 
doubt that will make a significant difference). The big thing that jumps out at 
me is the app engine SDK, but I don't know in what way that might affect things.

To try to figure this out, please make sure you download the very latest code 
from trunk to some directory on your hard drive. Then, while in that directory, 
start the Python 2.5 interpreter. If you download feedparser.py to 
/home/noiv11/tmp, then make sure that when you import feedparser and print its 
location that it matches the location you downloaded it to:

>>> import feedparser
>>> print feedparser.__file__ # should match the location you downloaded the 
file to
>>> import urllib2
>>> url = 
'http://blogs.nasa.gov/cm/rssfeed/blog/whatonearth.rss?et_cw=850&et_ch=600'
>>> feed = urllib2.urlopen(u)
>>> doc = feed.read()
>>> feed.close()
>>> a = feedparser.parse(doc)

I assume that it doesn't take 8 minutes to download the feed, but I want to 
eliminate that possibility by having urllib2 download the feed independent of 
feedparser. If you're still seeing absurd parsing times, I also want you to run 
the following code:

>>> feedparser.RESOLVE_RELATIVE_URIS = 0
>>> feedparser.SANITIZE_HTML = 0
>>> feedparser.PARSE_MICROFORMATS = 0
>>> feedparser.parse(url)

As for BeautifulSoup, it's only useful for microformat parsing. If it's 
installed it can affect some feeds' parsing times, but even with it installed I 
didn't see an 8 minute parse time for those feeds.

Please let me know what happens when you download the latest code and run the 
commands above in the Python 2.5 shell!

Original comment by kurtmckee on 30 Nov 2011 at 7:50

GoogleCodeExporter commented 9 years ago

Have you had a chance to investigate the issue further?

Original comment by kurtmckee on 10 Dec 2011 at 9:15

Changed state: NeedInfo

GoogleCodeExporter commented 9 years ago

Updated system to 11.10 and found 2 more feeds having same issue. For sure I 
know in production the App Engine behaves normal. I'll use receipt above to 
give you feedback soon.

Original comment by noi...@gmail.com on 10 Dec 2011 at 10:11

GoogleCodeExporter commented 9 years ago

You're right, with urllib2 parsing takes less than a second. But what's wrong 
with:

from utils import feedparser
from google.appengine.api import urlfetch
content = urlfetch.fetch(url, method=urlfetch.GET, deadline=20, 
allow_truncated=False).content 
feed = feedparser.parse(content)

All I know is current GAE SDK(v1.6.0) has UTF-8 issues, will test again with 
next update.

Original comment by noi...@gmail.com on 12 Dec 2011 at 5:57

GoogleCodeExporter commented 9 years ago

> What's wrong with:

I couldn't begin to tell you; I don't use Google App Engine. I'm curious what 
kind of object `fetch()` returns, and whether the GAE API is slow for some 
reason. Could you add code like:

import time
pre = time.time()
content = urlfetch.fetch(...) # whatever your code is
post = time.time()
print post - pre # output the time in seconds
print type(content) # output the content type
feed = feedparser.parse(content)
print time.time() - post # output the time it took to parse

It's possible that `content` has a slow `read()` interface if it's file-like, 
or perhaps there's another issue.

Original comment by kurtmckee on 13 Dec 2011 at 7:21

GoogleCodeExporter commented 9 years ago

Have you had an opportunity to further confirm whether this is a Google App 
Engine issue?

Original comment by kurtmckee on 10 Jan 2012 at 7:58

GoogleCodeExporter commented 9 years ago

I'm closing this as invalid because it appears to be an issue with Google App 
Engine, not feedparser.

Original comment by kurtmckee on 4 Feb 2012 at 8:39

Changed state: Invalid

pombreda / feedparser

parsing slow with NASA feeds #307