Closed rahulbot closed 2 years ago
Hypothesis from Phil - this may matter less in RSS feeds than in spidering. 🤷🏽♂️
Added a urls.normalized_url
method that performs this logic to the mediacloud-metadata libary (v0.5.0). Pass it a URL and it returns the normalized URL we can save and use.
Our current system applies a heuristic for creating a "normalized URL" for any given story, and then uses that to look for duplicates (background description). This is important because many stories are published under almost the same URL. We need to port this logic over to this backup fetcher, so that it:
normalized_url
(source code)normalized_url