Closed rahulbot closed 2 years ago
I added a titles.normalized_title
method that performs this string logic to the mediacloud-metadata libary (v0.5.0). Pass it a string, and the media_name and it returns the normalized title we can hash and use to compare.
Our current system applies a heuristic for creating a "normalized title" for any given story, and then uses that to look for duplicates (background description). This is important because many stories are published under almost the same title at different URLs. We need to port this logic over to this backup fetcher, so that it:
This is currently implemented as a database function, but could be ported over to python code applied to each story.