support story "auditing" trails?

rahulbot commented 9 months ago

Capturing an idea raised (by @philbudne) about whether it would be useful to have a trail that allowed us to audit where a story came from. I understood the idea to have two goals:

support debugging when we have to dig into quarantines or archive files
provide more transparency (a key goal of the larger project)

Staw-man proposal Here's a sketch to think through whether this is helpful or not. I could imagine a Story.data_source field that could hold this information. How would we populate that?

For RSS ingest, this would be some reference to the feed it was discovered in.
For historical ingest this could be the download_id and CSV file.
For existing content without this value it could be None.
For sitemap ingest, this could be the sitemap file name.

I'd imagine archiving this along with all the other Story object properties if we adopted it. Is this a useful feature to consider?

philbudne commented 9 months ago

Capturing an idea raised (by @philbudne) about whether it would be useful to have a trail that allowed us to audit where a story came from. I understood the idea to have two goals:

support debugging when we have to dig into quarantines or archive files

provide more transparency (a key goal of the larger project)

My interest probably came from watching fetcher logs while debugging my queue based fetcher, and wondering "where did THAT URL come from" (often due to a feed that includes links outside the feed domain).

Another experience that planted the seed was comparing RSS files from the legacy system and trying to track missing stories back to the original RSS feed, and see why we were discarding that data.

Staw-man proposal Here's a sketch to think through whether this is helpful or not. I could imagine a Story.data_source field that could hold this information. How would we populate that?

For RSS ingest, this would be some reference to the feed it was discovered in.

For historical ingest this could be the download_id and CSV file.

For existing content without this value it could be None.

For sitemap ingest, this could be the sitemap file name.

I'd imagine archiving this along with all the other Story object properties if we adopted it. Is this a useful feature to consider?

I don't doubt it would have SOME use to developers (especially for weird/unexpected stuff/crap that ends up in quarantine), but I'm not sure that's reason enough (and why I hadn't brought the topic back up). BUT now is a better time to ask the question than in a month!

I can easily imagine a researcher asking questions about the origin of a story, so it's as much a question for researchers about how often they looked at the feed id info that was available in the old system.

The following may be rambling, and no doubt redundant, but I don't have enough room in my brain to hold it all and Rahul's strawman points to to deal with them in a comprehensive and comprehensible form)!

As for what data might be available:

Current feed-fetcher RSS files contain no data, but I had pitched the idea of adding some. The rss-fetcher stories table has the following available: feed_id and sources_id. feed_id can be joined against the feeds table to get a feed URL.

https://www.rssboard.org/rss-specification#ltsourcegtSubelementOfLtitemgt has a tag, with a url property, that could be populated with the feed URL. We could add private properties. One reason I didn't leap ahead with this was concern of causing indigestion to IA.

The batch-fetcher currently populates RSSEntry.fetch_date with the date portion of the rss-fetcher generated rss file downloaded from S3, so we already have that bit of information.

Information in "historical" CSV files has the following fields that are not currently being saved anywhere: stories_id, media_id, feeds_id, as well as the name of the CSV file. I believe "media_id" is the legacy system term for what we now call "sources_id". We wouldn't have ready access to the URL of the original feed. We used the legacy sources and feeds ids as the basis for the new system, BUT, many sources and feeds were eliminated as duplicates (so we wouldn't be able to translate old ids).

CSV fields currently being saved by hist-queuer.py:

downloads_id in Story.RSSEntry.link
url in Story.HTTPMetadata.url
I have pending changes to convert collect_date to a timestamp to be stored in Story.HTTPMetadata.download_timestamp.

Also (conceivably available from S3 is a LastModified date, but that would take another RPC call, and I wasn't planning on going there. The LastModified date IS retrieved when a downloads_id is in the range that has been used in two different database epochs).

Since RSSEntry.fetch_date is essentially an excerpt of the name of the source file, which could be populated from the CSV file name.

Right now RSSEntry.fetch_date unambiguously indicates how the story was introduced (based on date range).

BUT, if sitemap processing is added, and is distinct from rss-fetcher (hard to say if it will or won't, since the basic polling part seems essentially identical, with the caveat that we would probably let the poll frequency of unchanging sitemap pages (ie; pages with articles posted to the site in a past month) drop to longer than we currently let the poll frequency of RSS files drop to), we might want a field that we can populate with an indication of how we got the URL (sitemap, RSS, historical).

The final, and REALLY BIG question is if researchers would like to have self-serve access to the information, and to store it (without indexing) in ES. Otherwise we would have to search through WARC files to extract the data!

I don't believe RSSEntry.fetch_date isn't being used downstream, so we may be free to (ab)use it to indicate the source.

rss-fetcher RSS files, historical CSV files, and sitemap pages all have filenames (and fetch_date is currently an excerpt of the RSS file name). WARC files, of course also have file names, but they also have fetch_date fields, and I think the best way to treat WARC files is as backups that we restore transparently (with no indication of where we restored the data from).

Historical CSV files (and future rss-fetcher RSS files, and sitemap pages) can also give us feed_id and source_id. Stories for WARC files generated to date cannot.

rahulbot commented 9 months ago

Note from meeting discussion: this data about where a story came from is not needed in ES.

philbudne commented 8 months ago

Closing: changes merged in https://github.com/mediacloud/story-indexer/pull/234 (in use for 2023 historical processing)

mediacloud / story-indexer

support story "auditing" trails? #216