Open hroberts opened 6 years ago
Related to #396
Either this wasn't actually a priority, or it is done. @hroberts can you tell me which?
still needs to be done. I manually fixed the data foe the particular topic when this proved more difficult than I though it would be.
story deduping is not finding all of the duplicates it should. for example, after deduping, the following stories remained in the nyt:
I think the problem is that the current deduping code only looks for dups of title parts that appear as a complete title at least once. In the above example, the substantive title never appears on its own, so the alternatives cannot be matched.
This is a pressing issue for the election work, because we need this deduped data for the 2017 twitter topics ASAP.