deduplication: add/check URL normalization before storing

mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.

https://search.mediacloud.org/directory

Apache License 2.0

5 stars 5 forks source link

deduplication: add/check URL normalization before storing #5

Closed rahulbot closed 2 years ago

rahulbot commented 2 years ago

Our current system applies a heuristic for creating a "normalized URL" for any given story, and then uses that to look for duplicates (background description). This is important because many stories are published under almost the same URL. We need to port this logic over to this backup fetcher, so that it:

creates and saves normalized_url (source code)
Only save future stories if they have a unique normalized_url

rahulbot commented 2 years ago

Hypothesis from Phil - this may matter less in RSS feeds than in spidering. 🤷🏽‍♂️

rahulbot commented 2 years ago

Added a urls.normalized_url method that performs this logic to the mediacloud-metadata libary (v0.5.0). Pass it a URL and it returns the normalized URL we can save and use.