Closed rahulbot closed 1 year ago
I took a look at the code under this again and am not convinced that adding media_name to title deduplication is doing much good. If I'm reading the original code correctly, then if you pass in a media_name, and the first part of the title is the media_name, then it returns the whole title instead of that first part. This doesn't seem particular relevant for tittle deduplication, because if a single site has two URLs with the same title it is likely to prefix them both with the site name. Plus our site name data is garbage. I don't think we need this logic added.
We've got the basics of title-based deduplication in place (#6), but one piece of the title-based deduplication looks for the media-name in the title and removes it, to produce a better normalized title that can be used for de-duplication. To support this we need this fetcher to have the name for each media source, which it doesn't right now.
The task here is to add in media_name (to the feeds table?) so that it can then be used in the call to create the Story db object, which via
Story.from_rss_entry()
makes a call to thetitles.normalize_title()
utility method provided by the metadata library.