Performance optimize and memory usage optimize

guyskk commented 4 years ago

I find that feedparser is very slow when parse large feeds, and it's also not fast when parse small feeds.

For example (about 5 MB): https://aotu.io/atom.xml feedparser cost 15 seconds to parse the feed, while another parser in golang (gofeed) only cost 100ms.

Another example (about 300KB): http://ohmymedia.com/feed/ feedparser cost 400ms while gofeed cost less than 10ms.

So I think there's much room for improvement. I did some analysis using pyinstrument, it shows resolve_relative_uris and _sanitize_html cost most of the time. If we replace them with lxml or other C implementation it would be very fast.

The memory usage is not very efficient too, string copy, encode and decode operations cost lot's of memory. I think it have some room for improvement but I didn't deep analysis it yet.

I'm the author of RSSAnt, a RSS reader web app, and use feedparser to parse feeds. The performance is very critical for me, and I'm glad to implement performance optimization for feedparser.

Do you have any suggestions?

guyskk commented 4 years ago

Set sanitize_html and resolve_relative_uris to False, then use lxml to process feedparser result is another option, and by this way I only process the needed fields, not all fields.

kurtmckee commented 4 years ago

@guyskk I agree that there is a lot of room for improvement, and I'm really grateful for the analysis you've already done. I'd like to draw from your experience with RSSAnt, and improve feedparser so that it's meeting your needs better.

I want to update this ticket so that it has a clear objective and can be driven to closure. I'm seeing several suggestions here, including:

Migrate resolve_relative_uris to C
Migrate _sanitize_html to C code (note that I would like to strip this custom code and use an external package)
Decrease encode/decode calls
Decrease string copies

Are there any other things that you can suggest besides these? After determining goals, let's establish priorities and possibly open additional tickets if needed.

Thanks for investigating this and reporting the results! I look forward to working with you!

guyskk commented 4 years ago

Thank you Kurt! Currently no other suggestions in my mind.
And I'm glad to build a RSS dataset so that we can benchmark/profile on it.

thigg commented 3 years ago

It would be really good to have this split up in small issues. With smaller, better specified packages people (like me) could pick up work.

thigg commented 3 years ago

I only spent a short time investigating, but my profiler showed that lot time was spent in sgmllib which is deprecated and hasn't seen updates since 2010. @kurtmckee do you think it could be beneficial to replace it?

thigg commented 2 years ago

Is there any plan to progress this issue? Did anyone do experiments with lxml? Does it make parsing faster?

mlissner commented 1 year ago

We're in the process of parsing about three million RSS feeds from federal courts, many of which are over 1MB in size.

We may take a look at making feedparser faster since it's currently our bottleneck in this project. @kurtmckee I assume such work is still welcome?

kurtmckee commented 1 year ago

Yes, please!

I'm currently working to migrate the test suite to pytest and make sure code coverage is getting checked. It's slow going, but I'm very open to performance improvements!

buhtz commented 1 year ago

Why pytest? I'm shocked but your opinion has value for me. So I ask to learn. :)

IMHO the only good thing about pytest is its commandline tool with nice colored output that does run my unittest-like tests. The problem with pytest-like tests is that they are hard to understand because they hide to much and doing to much things explicit. This is always the "pro" argument on conferences and blog posts: You need less lines of code to write your tests. This isn't a pro but a contra argument. I would argue if someone things there are to many lines in a (unit) test then the test is of low quality.

But it is just my opinion as a less experienced none-professional developer. So I ask to learn. So I think myself: If Kurt migrate to pytest there are very good reasons. I just don't see them. 😄

kurtmckee commented 1 year ago

There are a number of people subscribed to this thread, so -- without turning this into a pytest discussion thread! -- I'll summarize that:

it's possible to run tests in parallel
it's possible to randomize module and test execution order, which can shake out unexpected intradependencies in the test suite
it's possible to parametrize a test, which moves for loops inside a test function to outside the test function, and reports the results on all of the tested values, not just the first one that fails
pytest's dependency injection model makes it easy to clearly state "this test requires such-and-such mock or setup" (such as mocked HTTP responses)

These are some of the reasons I use pytest. I don't want this thread to become a discussion about pytest so I'll respond more in-depth to you privately.

kurtmckee / feedparser

Performance optimize and memory usage optimize #209