mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

add media_name so we can do better deduplication #11

Closed rahulbot closed 1 year ago

rahulbot commented 2 years ago

We've got the basics of title-based deduplication in place (#6), but one piece of the title-based deduplication looks for the media-name in the title and removes it, to produce a better normalized title that can be used for de-duplication. To support this we need this fetcher to have the name for each media source, which it doesn't right now.

The task here is to add in media_name (to the feeds table?) so that it can then be used in the call to create the Story db object, which via Story.from_rss_entry() makes a call to the titles.normalize_title() utility method provided by the metadata library.

rahulbot commented 1 year ago

I took a look at the code under this again and am not convinced that adding media_name to title deduplication is doing much good. If I'm reading the original code correctly, then if you pass in a media_name, and the first part of the title is the media_name, then it returns the whole title instead of that first part. This doesn't seem particular relevant for tittle deduplication, because if a single site has two URLs with the same title it is likely to prefix them both with the site name. Plus our site name data is garbage. I don't think we need this logic added.