Closed pannous closed 13 years ago
Do you have an example? Do you mean exact duplicates or near duplicates?
I thought that I have reduced the amount of duplicates with the spam filter ...
for reference only: duplicate filter could be done at query time (results grouping) or index time (like our spam filter works)
we need two new fields: duplicate_hash_s and duplicates_i
we check the duplicate_hash_s before indexing and set the duplicates_i accordingly. when querying we can descrease the duplicate count filter for aggressive duplicate removal.
the hash can be calculated as stated in the TermCreateCommand:
using a technic from TextProfileSignature:
great idea. when viewing a specific users tweets, we could disable the duplicate filter. (otherwise we could just mark them grey / collapse them or hide them)
Just implemented and deployed this idea. Please test the "Duplicates without" link.
PS: tweets are not reindexed so it will take a week until all tweets have this possibilty
ok, works reasonable after adjusting the jaccard index. please file a new issue for bugs
Duplicate Text filter would be great^100
otherwise you read same tweets 100 times :{ (when paginating)