mediacloud / sous-chef

Configurable Data Analytics Pipeline
1 stars 0 forks source link

Consider approaches to sentence-based deduplication #18

Open rahulbot opened 4 months ago

rahulbot commented 4 months ago

As documented in https://github.com/mediacloud/story-indexer/issues/278, we're seeing instances of headlines appearing at the tail end of stories and polluting results. We should consider the original idea of moving this stage of deduplication (which we used to do in the legacy system) to a sous-chef feature. The idea would be to do something like (a) tokenize by sentence, (b) remove duplicate sentences from stories after their first appearance, and (c) remove stories from the corpus that no longer match the query post-sentence-dedup'ing. This is non-trivial, but will take some design work.