Closed rahulbot closed 8 months ago
I ran a script on a random sample of 50k URLs from 2023_12-05. Of those, around 1800 (4%) resulted in a redirect to a different canonical domain. A little under 50% of the 1800 different-domain-redirects were from news.google.com. CSV of results sent on Slack.
Another possible issue arising from historical data CSV files not having the final/resolved URL is duplicate stories:
Closing. I we decided to just deal with this as is based on the volume of redirects.
Background While ingesting 2023 data (#168) @philbudne noted un-resolved URLs in the CSV we've generated from the historical archive to support reindexing. After some more investigating, we collectively believe that the fully resolved URL was not stored in an accessible way in the legacy database. This was OK in that system, because
stories
mapped tofeeds
, andfeeds
mapped tosources
, and we searched bysource_id
. In the new system we search bycanonical_domain
, which requires us to know the fully resolved URL to support accurate searching.Concern Since we don't seem to have the fully resolved URL, we need to see how much of an issue this is in the historical data. The current overall volume of historically ingested stories suggests this isn't a huge deal, and our gut says this would increase data un-associated with sources in our system, not mis-associate stories with incorrect sources. BUT we need some assessment to understand the scale of this issue so we know how concerned to be.
Task With a one-day sample of historical data that is fairly recent (Nov?), take all the URLs from the legacy system and follow headers to fully resolve them. I think you can just call an HTTP
HEAD
request instead of aGET
and make this quicker. This will be a few hundred thousand stories, so it probably needs some simple parallel processing (I usemultiprocessing.Pool
for quick solutions, but your mileage may vary). The goal is to add aresolved_url
column to the source data. Then with that we'd like to assess how many of the rows have have differences between original URLs and resolved URLs. For instance, how many rows have differentcanonical_domains
(extracted via mediacloud-metadata)?@philbudne can you please attach, or point to, an appropriate CSV file to use? (and add any clarifying/correcting notes)