mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

save_stories_from_feed performance improvement #22

Open opme opened 1 year ago

opme commented 1 year ago

I was reading the code save_stories_from_feed in tasks.py and it looks to be making one database call per feed entry to check for duplicates.

normalized_url_exists could be replaced by a single call to the database to check all feed entries at once.

There could a function call getValidFeedEntries that would apply the logic existing in save_stories_from_feed that skips invalid entries.

Then a single database call to identify what is duplicate and then bulk insert and commit.

If it sounds reasonable I can give it a try. This looks to be the eventual bottleneck of this implementation?