mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

generated daily RSS file URL counts don't match stories table daily counts #35

Closed rahulbot closed 4 months ago

rahulbot commented 4 months ago

In investigating #34 @philbudne found that while the daily RSS file generated for 2024-02-04 found 348,013 URLs to include (log msg), a number of queries show that there are 375,464 URLs in the database with a fetched_at date of that day. What explains this gap? Are there URLs in rss-fetcher that aren't making it to the daily file for some reason?

philbudne commented 4 months ago

Slightly cheering data point: Checked the most recently generated rss file vs the database and got the same count.

rss_fetcher=# select count(1) from stories where fetched_at >= '2024-02-07' and fetched_at < '2024-02-08';
 count  
--------
 556492
(1 row)

rss_fetcher=# 
\q
Connection to tarbell.angwin closed.
pbudne@tarbell:/space/dokku/data/storage/rss-fetcher-storage/logs$ grep Found gen_rss.fetcher.1.log
2024-02-08 00:27:12,904 | INFO | gen_rss |    Found 556492 stories

And the date/time in the log file matches the date on the RSS file (viewed as UTC):

pbudne@tarbell:/space/dokku/data/storage/rss-fetcher-storage/rss-output-files$ TZ=UTC ls -lt mc-2024-02-07.rss.gz 
-rw-r--r-- 1 rbharti 32767 50903946 Feb  8 00:27 mc-2024-02-07.rss.gz
rahulbot commented 4 months ago

We need to understand how much of a problem this is, or is not. A proposed data analysis task: audit the last 60-ish days of daily RSS files and compare them the RSS fetcher database to produce a report about how many URLs were left out of the RSS files (by day). The expected number is 0 -- ie. every URL in the database stories table for a given day should be in that day's RSS file. The report is that for at least one day (2024-02-04) it was a noticeable number.

@nullpxl this is a task we could use your help on. You can download the public RSS files from https://mediacloud-public.s3.amazonaws.com/backup-daily-rss/mc-{fetch_date}.rss.gz (where fetch_date is "YYYY-MM-DD" I think). @philbudne could you DM him a recent backup of the rss-fetcher Postgres DB? First task could be to just compare counts by day, highlighting how many days were non-zero and a total. Second task could be generating RSS files (in the same fashion) that include those URLs that were missed, so we can later figure out how to add them to the "hopper" for fetching. Just let know if you need more background context.

NullPxl commented 4 months ago

Oddly I'm actually getting the same number of URLs from the past 60 RSS feed backups and the DB csv export (when split by day). Working through getting any links that might differ; unfortunately the RSS links seem to have html encodings for things like ampersands while the URLs in the db do not. image

NullPxl commented 4 months ago

After getting rid of all differences caused by html encoding and whitespace, the days seem to all have the same URLs.

I believe there are a few likely areas that could be causing this discrepancy. A logic issue at the 'date grouping' stage at some point of processing (for example, I used df.groupby(df['fetched_at'].dt.date) to split up the csv from the db backup, maybe this makes faulty assumptions?), something weird with the PostgreSQL db query that showed the mismatch, or a 'counting' issue, e.g., I noticed that some URLs have additional newlines in the middle of them so maybe it's possible PostgreSQL double counts those.

rahulbot commented 4 months ago

In case it helps, one of the drivers of this is the npr.org stories that are "missing". As noted on #34, story id 543540153 is in the database but not in mc-2024-02-03.rss. Can you replicate this finding about this particular story with the data you have in hand?

rss_fetcher=# select * from stories where title like '%basketball fan frenzy over Iowa%';
-[ RECORD 1 ]---------+--------------------------------------------------------------------------------------------
id                    | 543540153
feed_id               | 1992002
url                   | https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy
guid                  | https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy
published_at          | 2024-02-03 20:31:58
fetched_at            | 2024-02-03 21:42:25.545889
domain                | npr.org
title                 | Why there's a basketball fan frenzy over Iowa's Caitlin Clark
normalized_url        | http://npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzy
normalized_title      | why there's a basketball fan frenzy over iowa's caitlin clark
normalized_title_hash | 923f6ed6cc7fd5f5ef1a3da119d92d7e
sources_id            | 1096
NullPxl commented 4 months ago

Just to quickly check, I re-downloaded mc-2024-02-03.rss and it seems that that npr story mentioned above is present in the file (it was also in the file I had downloaded previously). 1

Also, from the text at the top of this issue:

the daily RSS file generated for 2024-02-04 found 348,013 URLs to include (log msg), a number of queries show that there are 375,464 URLs in the database with a fetched_at date of that day

Is it possible there was a miscommunication here? From my data, '348,013' is the URL count for 2024-02-04, while '375,464' is the URL count from 2024-02-03 2

philbudne commented 4 months ago

Yes, it's ENTIRELY possible I counted the wrong file to compare to the SQL query (confusing the date the file was generated vs the date in the file name).

I looked at the counts in response to the question about a missing npr.org URL, which (as rahul pointed out) is https://github.com/mediacloud/rss-fetcher/issues/34

And it DOES look like the four entries found in the DB query are all present in the RSS file:

@.***:/space/dokku/data/storage/rss-fetcher-storage/rss-output-files$ zgrep 'basketball fan frenzy over Iowa' mc-2024-02-03.rss.gz

https://kjzz.org/content/1870234/why-theres-basketball-fan-frenzy-over-iowas-caitlin-clarkSat, 03 Feb 2024 20:31:00 -0000kjzz.orgWhy there's a basketball fan frenzy over Iowa's Caitlin Clark https://www.kios.org/2024-02-03/why-theres-a-basketball-fan-frenzy-over-iowas-caitlin-clarkSat, 03 Feb 2024 20:31:58 -0000kios.orgWhy there's a basketball fan frenzy over Iowa's Caitlin Clark https://upstract.com/x/1e1aefc8fd48fa7f?ref=rssSat, 03 Feb 2024 20:44:17 -0000upstract.comWhy there's a basketball fan frenzy over Iowa's Caitlin Clark https://www.npr.org/2024/02/03/1228858826/caitlin-clark-iowa-basketball-ncaa-frenzySat, 03 Feb 2024 20:31:58 -0000npr.orgWhy there's a basketball fan frenzy over Iowa's Caitlin Clark

But only three are present in the WARC files (entries generated after the article is successfully entered into ES):

@.**:/srv/data/docker/indexer/worker_data/archiver/2024/02/05$ ls | xargs -n5 -P30 zgrep -B20 -A20 -ai 'normalized_article_title.basketball fan frenzy' | grep final_url mc-20240205055830-194-8c2594d29883.warc.gz- "final_url": "https://www.kios.org/2024-02-03/why-theres-a-basketball-fan-frenzy-over-iowas-caitlin-clark", mc-20240205000515-132-8c2594d29883.warc.gz- "final_url": "https://upstract.com/x/1e1aefc8fd48fa7f?ref=rss", mc-20240205020532-164-8c2594d29883.warc.gz- "final_url": "https://kjzz.org/content/1870234/why-theres-basketball-fan-frenzy-over-iowas-caitlin-clark",

It looks like the fetch of the NPR article failed. Details in #34.

rahulbot commented 4 months ago

Ok. Sounds like this was date confusion in the end, and not a real error. Good learning here on how to trace URLs through our system and where we to best log 👍🏽. Closing.