add media_name so we can do better deduplication

mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.

Apache License 2.0

5 stars 5 forks source link

We've got the basics of title-based deduplication in place (#6), but one piece of the title-based deduplication looks for the media-name in the title and removes it, to produce a better normalized title that can be used for de-duplication. To support this we need this fetcher to have the name for each media source, which it doesn't right now.

The task here is to add in media_name (to the feeds table?) so that it can then be used in the call to create the Story db object, which via Story.from_rss_entry() makes a call to the titles.normalize_title() utility method provided by the metadata library.

mediacloud / rss-fetcher

add media_name so we can do better deduplication #11