mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

initial commit for perf improvement in tasks save_stories_from_feed #23

Open opme opened 1 year ago

opme commented 1 year ago

improve performance saving stories from feed. In the case where there are 500 stories that need duplicate checking, this would reduce sql to database from 1000 to 2. interested in feedback for this pr. So far tested against the small dokku feedset. I have the full 110k feeds running on a cloud server and will deploy there also.

Todo: add an index to Stories.sources_id batch insert of new stories into stories table?

opme commented 1 year ago

Running the full 110K feeds now with this pull request. Database cpu and network is reduced.
Setup is 2 cpu app server with 16 workers, 1 cpu postgres server. Still want to reduce the spikes.

perfdb12hour perfnetwork_new_old3

Feed table has most reads:

perfreads

current cpu on app server, Want to get it more even without the hills and valleys. It maxes out the 2 cpu for some time then falls to almost zero.

image

philbudne commented 1 year ago

A few thoughts:

  1. I think we check sources_id in the titles query because we want to allow the same title across different sources, but we want to avoid duplicate URLs in the output.
  2. Our current environment is (old) dedicated hardware with lots of free CPU cycles. Depending on your workload, and available memory, it might make sense to keep an in memory (redis?) cache for titles and URLs (if you don't have lots of memory, redis will end up paging, and causing I/O). I'm not convinced an SQL stories table is the best solution: it's a write-mostly table, and (ideally) should contain all URLs/titles ever seen (I put in table archiving so that backups don't become more and more onerous over time). My original thought was to make stories a partitioned table (with one table per month), but I couldn't find any ready to use implementations of partitioned table management in alembic (never mind the fun of migrating back and forth). At least with table partitions we'd only need to back up the current month (past months would be read-only). And I can't help thinking a non-SQL store might suit our needs as well (or better), and we may need to implement a duplicate URL test in other contexts, which means implementing the solution as a library or service might make sense.
philbudne commented 1 year ago
  1. The reason we don't do batch inserts into the stories table is that the table has a unique constraint on the url column, so (at the very least) we'd need to do the insert with "on conflict ignore" (I don't have enough past history with the project, postgresql or RDBs in general) to be sure that querying for matches first has any benefit.

Lastly, let me warn you that the latest revision of the code (v0.14) has a new implementation of RSS fetch scheduling, and that it's in its infancy, and not at ALL well tuned, either in terms of database access or in terms of algorithmic complexity (I can point to many places that are likely to have O(n^2) worst case behavior)!!!