Duplicate Text filter would be great^100

karussell / Jetwick

[not maintained] Custom Twitter Search via ElasticSearch&Wicket

61 stars 15 forks source link

Duplicate Text filter would be great^100 #2

Closed pannous closed 13 years ago

pannous commented 13 years ago

Duplicate Text filter would be great^100

otherwise you read same tweets 100 times :{ (when paginating)

karussell commented 13 years ago

Do you have an example? Do you mean exact duplicates or near duplicates?

I thought that I have reduced the amount of duplicates with the spam filter ...

for reference only: duplicate filter could be done at query time (results grouping) or index time (like our spam filter works)

karussell commented 13 years ago

we need two new fields: duplicate_hash_s and duplicates_i

we check the duplicate_hash_s before indexing and set the duplicates_i accordingly. when querying we can descrease the duplicate count filter for aggressive duplicate removal.

the hash can be calculated as stated in the TermCreateCommand:

using a technic from TextProfileSignature:

create a list of tokens and their frequency, separated by spaces, in the order of decreasing frequency (+ sorted because freq is one nearly for all tokens I think).
This list is then submitted to an MD5 hash calculation.

pannous commented 13 years ago

great idea. when viewing a specific users tweets, we could disable the duplicate filter. (otherwise we could just mark them grey / collapse them or hide them)

karussell commented 13 years ago

http://wiki.apache.org/solr/Deduplication

karussell commented 13 years ago

Just implemented and deployed this idea. Please test the "Duplicates without" link.

PS: tweets are not reindexed so it will take a week until all tweets have this possibilty

karussell commented 13 years ago

ok, works reasonable after adjusting the jaccard index. please file a new issue for bugs