mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

deduplication: add/check normalized title hash #6

Closed rahulbot closed 2 years ago

rahulbot commented 2 years ago

Our current system applies a heuristic for creating a "normalized title" for any given story, and then uses that to look for duplicates (background description). This is important because many stories are published under almost the same title at different URLs. We need to port this logic over to this backup fetcher, so that it:

  1. creates and saves normalized_title (source code)
  2. Only save future stories if they have a unique normalized_title

This is currently implemented as a database function, but could be ported over to python code applied to each story.

rahulbot commented 2 years ago

I added a titles.normalized_title method that performs this string logic to the mediacloud-metadata libary (v0.5.0). Pass it a string, and the media_name and it returns the normalized title we can hash and use to compare.