mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

Index stories table by sources_id? #26

Open philbudne opened 8 months ago

philbudne commented 8 months ago

web search (mcweb) feed watcher makes copious use of /api/sources/NNN/stories/fetched-by-day endpoint, which generates queries like:

SELECT CAST(stories.fetched_at AS DATE) AS day, count(*) AS stories
FROM stories
WHERE stories.sources_id = $1::BIGINT GROUP BY CAST(stories.fetched_at AS DATE) ORDER BY day

the stories table currently has the following indices:

Indexes:
    "stories_pkey" PRIMARY KEY, btree (id)
    "stories_domain" btree (domain)
    "stories_feed_id" btree (feed_id)
    "stories_fetched_at" btree (fetched_at)
    "stories_published_at" btree (published_at)
    "unique_story_title" btree (normalized_title_hash, sources_id)
    "unique_story_url" UNIQUE, btree (normalized_url)

So each query has to scan the entire table, which has on the order of 45 million rows.

philbudne commented 8 months ago

From logs:

2023-10-17 02:52:28,017 | INFO | uvicorn.access | 172.19.0.7:53918 - "GET /api/sources/282703/stories/fetched-by-day HTTP/1.1" 200
2023-10-17 02:52:32,683 | INFO | server.util | endpoint: server.sources.fetch_source_stories_fetched_by_day, status: OK, duration: 4.6495983600616455 sec
2023-10-17 02:52:32,684 | INFO | uvicorn.access | 172.19.0.7:53918 - "GET /api/sources/282705/stories/fetched-by-day HTTP/1.1" 200
2023-10-17 02:52:37,626 | INFO | server.util | endpoint: server.sources.fetch_source_stories_fetched_by_day, status: OK, duration: 4.933398485183716 sec
2023-10-17 02:52:37,633 | INFO | uvicorn.access | 172.19.0.7:53918 - "GET /api/sources/282706/stories/fetched-by-day HTTP/1.1" 200
2023-10-17 02:52:42,604 | INFO | server.util | endpoint: server.sources.fetch_source_stories_fetched_by_day, status: OK, duration: 4.952731132507324 sec
2023-10-17 02:52:42,605 | INFO | uvicorn.access | 172.19.0.7:53918 - "GET /api/sources/20563/stories/fetched-by-day HTTP/1.1" 200
2023-10-17 02:52:47,474 | INFO | server.util | endpoint: server.sources.fetch_source_stories_fetched_by_day, status: OK, duration: 4.851433038711548 sec
2023-10-17 02:52:47,477 | INFO | uvicorn.access | 172.19.0.7:53918 - "GET /api/sources/20569/stories/fetched-by-day HTTP/1.1" 200
2023-10-17 02:52:52,322 | INFO | server.util | endpoint: server.sources.fetch_source_stories_fetched_by_day, status: OK, duration: 4.832204580307007 sec
2023-10-17 02:52:52,323 | INFO | uvicorn.access | 172.19.0.7:53918 - "GET /api/sources/282715/stories/fetched-by-day HTTP/1.1" 200
2023-10-17 02:52:57,192 | INFO | server.util | endpoint: server.sources.fetch_source_stories_fetched_by_day, status: OK, duration: 4.851716041564941 sec
2023-10-17 02:52:57,194 | INFO | uvicorn.access | 172.19.0.7:53918 - "GET /api/sources/282723/stories/fetched-by-day HTTP/1.1" 200
2023-10-17 02:53:02,104 | INFO | server.util | endpoint: server.sources.fetch_source_stories_fetched_by_day, status: OK, duration: 4.892181158065796 sec
2023-10-17 02:53:02,105 | INFO | uvicorn.access | 172.19.0.7:53918 - "GET /api/sources/20579/stories/fetched-by-day HTTP/1.1" 200
2023-10-17 02:53:06,869 | INFO | server.util | endpoint: server.sources.fetch_source_stories_fetched_by_day, status: OK, duration: 4.757479190826416 sec
2023-10-17 02:53:06,870 | INFO | uvicorn.access | 172.19.0.7:53918 - "GET /api/sources/20580/stories/fetched-by-day HTTP/1.1" 200
2023-10-17 02:53:11,481 | INFO | server.util | endpoint: server.sources.fetch_source_stories_fetched_by_day, status: OK, duration: 4.602452278137207 sec
2023-10-17 02:53:11,482 | INFO | uvicorn.access | 172.19.0.7:53918 - "GET /api/sources/20581/stories/fetched-by-day HTTP/1.1" 200
2023-10-17 02:53:16,318 | INFO | server.util | endpoint: server.sources.fetch_source_stories_fetched_by_day, status: OK, duration: 4.82752537727356 sec
philbudne commented 8 months ago

The feed watcher can issue 18K queries, at 5 seconds each, than can multiply out to 25 hours!!!