mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

assess redirect volume in historical data #251

Closed rahulbot closed 8 months ago

rahulbot commented 9 months ago

Background While ingesting 2023 data (#168) @philbudne noted un-resolved URLs in the CSV we've generated from the historical archive to support reindexing. After some more investigating, we collectively believe that the fully resolved URL was not stored in an accessible way in the legacy database. This was OK in that system, because stories mapped to feeds, and feeds mapped to sources, and we searched by source_id. In the new system we search by canonical_domain, which requires us to know the fully resolved URL to support accurate searching.

Concern Since we don't seem to have the fully resolved URL, we need to see how much of an issue this is in the historical data. The current overall volume of historically ingested stories suggests this isn't a huge deal, and our gut says this would increase data un-associated with sources in our system, not mis-associate stories with incorrect sources. BUT we need some assessment to understand the scale of this issue so we know how concerned to be.

Task With a one-day sample of historical data that is fairly recent (Nov?), take all the URLs from the legacy system and follow headers to fully resolve them. I think you can just call an HTTP HEAD request instead of a GET and make this quicker. This will be a few hundred thousand stories, so it probably needs some simple parallel processing (I use multiprocessing.Pool for quick solutions, but your mileage may vary). The goal is to add a resolved_url column to the source data. Then with that we'd like to assess how many of the rows have have differences between original URLs and resolved URLs. For instance, how many rows have different canonical_domains (extracted via mediacloud-metadata)?

@philbudne can you please attach, or point to, an appropriate CSV file to use? (and add any clarifying/correcting notes)

NullPxl commented 9 months ago

I ran a script on a random sample of 50k URLs from 2023_12-05. Of those, around 1800 (4%) resulted in a redirect to a different canonical domain. A little under 50% of the 1800 different-domain-redirects were from news.google.com. CSV of results sent on Slack.

philbudne commented 9 months ago

Another possible issue arising from historical data CSV files not having the final/resolved URL is duplicate stories:

  1. Historical stories (not picked up by new system) with multiple URLs that redirected to same final URL.
  2. Historical stories picked up by new system, but stored keyed by final URL.
rahulbot commented 8 months ago

Closing. I we decided to just deal with this as is based on the volume of redirects.