pombreda / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

performance improvements #300

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Attached is a diff that gives me 8-10% performance improvements.

Perhaps you can find it useful.

Original issue reported on code.google.com by mrj...@gmail.com on 17 Aug 2011 at 1:31

Attachments:

GoogleCodeExporter commented 9 years ago
Use this diff instead, there was one-line that was off in the previous one.

Original comment by mrj...@gmail.com on 17 Aug 2011 at 1:36

Attachments:

GoogleCodeExporter commented 9 years ago
After just glancing at -- but not yet running -- the changes, a lot of the 
patch looks pretty good for performance improvements. For the benefit of 
others, it looks like the patch:

 * replaces most lists with sets, since they're just used for membership tests
 * moves methods out of classes to avoid method resolution
 * ditches silly string formatting code that uses `locals()`
 * replaces `has_key()` membership tests with `in` tests

I have on my todo list to remove the `locals()` calls, but I'm really 
interested in the list/set switch. I also have a few thoughts pertaining to 
compatibility:

`any()` was only introduced in Python 2.5. I expect it would be better than 
using the current `filter()` code, but it will be necessary to conditionally 
create a pure-Python fallback for Python 2.4. It's a simple addition, though, 
so I'm delighted with the more readable code!

If I recall correctly the `has_key()` garbage is necessary for one or two of 
the older supported Python versions' HTTP headers or XML attributes objects; 
the objects are dict-alikes that don't actually support the `in` tests. This 
absurdity was fixed in later Python versions, but older Pythons require 
`has_key()` membership tests (and calls to `keys()` for that matter). That's my 
memory of why I didn't replace the `has_key()` calls, anyway.

Finally, I'm not immediately sold on moving the `unique()`, `normalize()`, and 
`toISO8601()` methods into the global namespace, but I won't argue if it gives 
performance-conscious developers ideas about how to further speed up feedparser!

I'm grateful for the patch, and I'm looking forward to testing it!

Original comment by kurtmckee on 17 Aug 2011 at 6:39

GoogleCodeExporter commented 9 years ago
Great, thanks for the quick response.  I agree with your comments about older 
Python versions.

One question for you, though -- on the "polluting the namespace" issue, is that 
to handle the "from feedparser import *" case?  If so, can't you just define 
__all__ = ["parse"] ?

Original comment by mrj...@gmail.com on 17 Aug 2011 at 7:01

GoogleCodeExporter commented 9 years ago
I've incrementally applied most of your patch, culminating in r568.

Of note, I skipped the addition of `any()`, and didn't move the 
microformat-related functions into the global namespace. My reasoning was that 
the microformat code relies on BeautifulSoup, which isn't Python 3 compatible 
and probably will never be. The microformat code is also a major source of 
crashes and unicode problems, and I want to replace it. Consequently, it seems 
better to keep those functions with the class that will eventually be revamped 
or completely replaced.

I also discovered that the `has_key()` concerns I noted above were unfounded. I 
guess my memory is of something else.

I do have a question for you, though! How are you measuring performance 
improvements? I run the unit tests across Python 2.4 through Python 3.2 and see 
wild variations if I'm watching a video, visit a Javascript-heavy webpage, or 
stop my music player. cProfile only reveals algorithmic issues, but isn't very 
useful for measuring performance improvements.

And finally, in the commit messages I've credited your Google username; if you 
provide your real name I'd be delighted to give you better credit!

Original comment by kurtmckee on 28 Aug 2011 at 5:58

GoogleCodeExporter commented 9 years ago
Hi Kurt,

Thanks for applying the changes.  Have you looked at the locals() change also?

I was inspired by a post about performance of version 5.0 
(http://blog.yjl.im/2011/02/python-universal-feed-parser-5-speed.html).  For 
testing it, I used those Blogger test files he uses as well as a few feeds like 
Techcrunch.  I also made sure the unit tests didn't run slower (albeit those 
aren't really setup for performance testing).

If you'd like to credit my real name, you can mention "John Benediktsson".

By the way, I've gotten a 45% speedup using lxml for parsing microformats 
instead of BeautifulSoup.  It passes all the unit tests, too.  I'm tinkering 
with a full lxml port of feedparser that should take about 5-10% of the time 
the current one does, although I'm struggling with one possible bug in lxml 
right now.  Are you interested?

Best,
John.

Original comment by mrj...@gmail.com on 28 Aug 2011 at 6:06

GoogleCodeExporter commented 9 years ago
I definitely applied the locals() changes. :)

Thanks for linking me to that blog entry; the author's Python 2.5 vs 2.6 
numbers indicate that he very likely has BeautifulSoup in his Python 2.6 path, 
which is probably why he's seeing quadrupled run times, but I'll post a better 
analysis on my own blog one of these days. The ponderous feed files will be 
great to run through cProfile, too.

Performance is important to me, and is only second to readable, maintainable 
code, so yes, I'm interested in your lxml work, and keep them patches a-comin'!

Original comment by kurtmckee on 28 Aug 2011 at 7:33