Duplicate Filtering - Githubissues

brianherbert commented 12 years ago

Should Swift be filtering duplicate tweets (from different authors)?

If so, it's not working on UTF-8 tweets.

ekala commented 12 years ago

At the moment, the only duplicates that the twitter app filters out are retweets. The app simply grabs tweets via Twitter's streaming API and submits them for semantic and media extraction; it's a pipeline of sorts.

After processing, the drops get posted to the DB via an API endpoint that resides on the web app. At this point we generate an md5 hash for the drop. The hash is performed over a concatenation of the following:

ID of the drop's author at source. In the case of tweets, it would be the id tied to your handle/screen name
Channel associated with the drop i.e. Twitter, RSS etc
ID of the drop at source i.e. the id_str entity. See https://dev.twitter.com/docs/tweet-entities for more information

That said, I think what we may have to devise a way for users to define their own duplication filters in addition to simply maintaining a hash of the actual drop content.

Thoughts?

brianherbert commented 12 years ago

I think if the drop contents are exactly the same (maybe different author) then it should be hidden. Maybe there can be something like a spam folder (duplicate folder?) where an admin could bring individual drops back into the fold. I dunno. Anything to keep a screen full of the same content from showing up would be ideal.

69mb commented 12 years ago

Not a core issue.

It would be nice to group similar drops but we cannot support this we the current backend and ui. Perhaps in a future iteration.

ushahidi / SwiftRiver-Core

Duplicate Filtering #15