talkatv / talkatv

An open source commenting system
GNU Affero General Public License v3.0
208 stars 21 forks source link

Spam protection #13

Open joar opened 12 years ago

joar commented 12 years ago

via ActivitySpam/spamicity/Akismet/etc.

Might be a good idea to send any requests to these services asynchronically.

gllmhyt commented 12 years ago

Some sources, afaik : Spamhaus, DShield, Project Honeypot, BlogSpam.net

melpomene commented 11 years ago

I have forked the project with the intention to add some sort of comment spam filter. https://github.com/melpomene/talkatv

melpomene commented 11 years ago

Defence measures

Primary defence would be a machine learning algorithm (prob. SVM, bayesian filtering) that would look at the content of the spam message and dependent of content choose to flag it as spam or not. It could also look at user-agent/ip and other infromation about the agent.

Another warning flag could be if the time between opening the website and posting the comment is way to small, and other behavior patterns such as these.

The biggest problem right now is to get hold of good data to train the SVM algorithm with. We need a database/list of comments which are either marked as spam or not. If you find any good sources for this (I'm sure there are some) please feel free to notifiy me.

joar commented 11 years ago

Sounds great. If memory serves me right, @evanp (https://github.com/evanp/) implemented a bayesian spam filter for StatusNet. It might be an idea to check with him for data. He also possesses an enormous database of entries, both spam and not spam, because identi.ca.

melpomene commented 11 years ago

Great! I'll drop him an email.

evanp commented 11 years ago

Hey, guys. So, I've been working on a project in the space, https://github.com/evanp/activityspam . There's a live site at https://spamicity.info/ with a pretty simple API https://spamicity.info/api . The spamicity.info site was trained with about 10,000 accounts from identi.ca -- known spammers and known-good accounts. It's since been trained with about 50,000 posts on identi.ca (via the API), which has made it pretty reliable.

I'd really recommend using spamicity.info if you're looking for it. I'd really rather improve the existing code, if you have time to do it.

melpomene commented 11 years ago

The project should definatly try it out! I have written the basic bayesfilter already and since I am writing it just for fun I will keep tinkering with it either way.

The implementation I am writing now is written in python so it could easily run along side the server. This would allow the data to stay on the "comment" server and it would allow personalization tweaking of each server spam filter. It might also be that there is a differense in the nature of the spam on blogs and on indenti.ca.

Either way I am lacking data right now, I am thinking about maybe writing a crawler and gathering some comments which I could then test against spamcity. This could be used as reference material. I am also experimenting with using the spam filter to stop threats and abusive comments (in swedish: http://spamsamlaren.kejsarmakten.se/) not sure if it ever will be usefull but it is already showing some interesting results.