mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
280 stars 87 forks source link

fix topic story deduping #346

Open hroberts opened 6 years ago

hroberts commented 6 years ago

story deduping is not finding all of the duplicates it should. for example, after deduping, the following stories remained in the nyt:

Opinion | When the World Is Led by a Child - The New York Times
Op-Ed Columnist: When the World Is Led by a Child
Opinion | When the World Is Led by a Child

I think the problem is that the current deduping code only looks for dups of title parts that appear as a complete title at least once. In the above example, the substantive title never appears on its own, so the alternatives cannot be matched.

This is a pressing issue for the election work, because we need this deduped data for the 2017 twitter topics ASAP.

rahulbot commented 6 years ago

Related to #396

rahulbot commented 5 years ago

Either this wasn't actually a priority, or it is done. @hroberts can you tell me which?

hroberts commented 5 years ago

still needs to be done. I manually fixed the data foe the particular topic when this proved more difficult than I though it would be.