deduplication: add/check normalized title hash

mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.

Apache License 2.0

5 stars 5 forks source link

Our current system applies a heuristic for creating a "normalized title" for any given story, and then uses that to look for duplicates (background description). This is important because many stories are published under almost the same title at different URLs. We need to port this logic over to this backup fetcher, so that it:

creates and saves normalized_title (source code)
Only save future stories if they have a unique normalized_title

This is currently implemented as a database function, but could be ported over to python code applied to each story.

mediacloud / rss-fetcher

deduplication: add/check normalized title hash #6