Open guyskk opened 4 years ago
Set sanitize_html and resolve_relative_uris to False, then use lxml to process feedparser result is another option, and by this way I only process the needed fields, not all fields.
@guyskk I agree that there is a lot of room for improvement, and I'm really grateful for the analysis you've already done. I'd like to draw from your experience with RSSAnt, and improve feedparser so that it's meeting your needs better.
I want to update this ticket so that it has a clear objective and can be driven to closure. I'm seeing several suggestions here, including:
resolve_relative_uris
to C_sanitize_html
to C code (note that I would like to strip this custom code and use an external package)Are there any other things that you can suggest besides these? After determining goals, let's establish priorities and possibly open additional tickets if needed.
Thanks for investigating this and reporting the results! I look forward to working with you!
Thank you Kurt! Currently no other suggestions in my mind.
And I'm glad to build a RSS dataset so that we can benchmark/profile on it.
It would be really good to have this split up in small issues. With smaller, better specified packages people (like me) could pick up work.
I only spent a short time investigating, but my profiler showed that lot time was spent in sgmllib which is deprecated and hasn't seen updates since 2010. @kurtmckee do you think it could be beneficial to replace it?
Is there any plan to progress this issue? Did anyone do experiments with lxml? Does it make parsing faster?
We're in the process of parsing about three million RSS feeds from federal courts, many of which are over 1MB in size.
We may take a look at making feedparser faster since it's currently our bottleneck in this project. @kurtmckee I assume such work is still welcome?
Yes, please!
I'm currently working to migrate the test suite to pytest and make sure code coverage is getting checked. It's slow going, but I'm very open to performance improvements!
Why pytest? I'm shocked but your opinion has value for me. So I ask to learn. :)
IMHO the only good thing about pytest is its commandline tool with nice colored output that does run my unittest-like tests. The problem with pytest-like tests is that they are hard to understand because they hide to much and doing to much things explicit. This is always the "pro" argument on conferences and blog posts: You need less lines of code to write your tests. This isn't a pro but a contra argument. I would argue if someone things there are to many lines in a (unit) test then the test is of low quality.
But it is just my opinion as a less experienced none-professional developer. So I ask to learn. So I think myself: If Kurt migrate to pytest there are very good reasons. I just don't see them. 😄
There are a number of people subscribed to this thread, so -- without turning this into a pytest discussion thread! -- I'll summarize that:
for
loops inside a test function to outside the test function, and reports the results on all of the tested values, not just the first one that failsThese are some of the reasons I use pytest. I don't want this thread to become a discussion about pytest so I'll respond more in-depth to you privately.
I find that feedparser is very slow when parse large feeds, and it's also not fast when parse small feeds.
For example (about 5 MB):
https://aotu.io/atom.xml
feedparser cost 15 seconds to parse the feed, while another parser in golang (gofeed) only cost 100ms.Another example (about 300KB):
http://ohmymedia.com/feed/
feedparser cost 400ms while gofeed cost less than 10ms.So I think there's much room for improvement. I did some analysis using pyinstrument, it shows
resolve_relative_uris
and_sanitize_html
cost most of the time. If we replace them with lxml or other C implementation it would be very fast.The memory usage is not very efficient too, string copy, encode and decode operations cost lot's of memory. I think it have some room for improvement but I didn't deep analysis it yet.
I'm the author of RSSAnt, a RSS reader web app, and use feedparser to parse feeds. The performance is very critical for me, and I'm glad to implement performance optimization for feedparser.
Do you have any suggestions?