mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
389 stars 195 forks source link

Clean out spam user profiles #3180

Closed crowbot closed 2 years ago

crowbot commented 8 years ago

Depends on using #3146 or #3147 or some other way to stop the inflow of new spam in order for this not to become a perpetual maintenance task.

lizconlan commented 8 years ago

Hmm, not simple to programmatically sift spam from not with certainty. Would it be worth trying to assign some kind of score? I've found a few things which seem to pretty much guarantee a spammer at the moment but this covers quite a small percentage.

I would also suggest some obfuscation of our exact rules and methods to avoid accidentally making a list of quirks for future spammers to avoid.

garethrees commented 8 years ago

I've found a few things which seem to pretty much guarantee a spammer

Does it look like https://github.com/mysociety/alaveteli/issues/3146 slowed the inflow of these obvious spammers?

I would also suggest some obfuscation of our exact rules and methods

Send an email to the group to start the discussion behind a log in.

Would it be worth trying to assign some kind of score?

I'd be inclined to use something like rakismet. It looks easy enough to hack it to use independently (rather than via an include). In any case, we'd want some interactivity in the rake task before we actually ban/delete a user, so for now I think using your initial heuristics and then a confirmation step would be a good start.

garethrees commented 8 years ago

So here's what I was going to do, but Akismet requires an IP address for the user. Sure, we could fake it, but I'd imagine that would skew the results.

Rakismet.key = ENV['AKISMET_API_KEY']
Rakismet.url = 'http://10.10.10.30:3000/'
Rakismet.host = 'rest.akismet.com'
Rakismet.validate_key

data = {
  :comment_author => 'MelindaGates11',
  :comment_author_email => 'MelindaGates11@mail.ru',
  :comment_content => "http://XN--12C1CC9AL4E0G7A.COM\r\n\r\nHow to dislodge the lash The birds live and then need the eyelid area with a minimum thickness of 3 mm, when the birds are the gel fibre Lotus girls legs similar to a long cable to balance it on the fly. This is where the innovation is to import another type of characteristics of the domain renfin is made from stainless steel",
  :comment_date_gmt => '2016-05-18T13:10:28+01:00',
  :blog_lang => 'en,cy',
  :blog_charset => 'UTF-8'
}

Rakismet.akismet_call('comment-check', data)
# => "Missing required field: user_ip."

data[:user_ip] = '127.0.0.1'
Rakismet.akismet_call('comment-check', data)
# => "false"

# Googled "Russian IP" and copied the first thing I saw
data[:user_ip] = '95.108.142.138'
Rakismet.akismet_call('comment-check', data)
# => "false"

# Using their test comment author made the call return "true" (i.e. spam)
data[:comment_author] = 'viagra-test-123'
Rakismet.akismet_call('comment-check', data)
# => "true"
lizconlan commented 8 years ago

that seems to imply that it's putting more weight on the :comment_author than is helpful :cry:

does anything interesting happen if we pull out the included link as :comment_author_url?

lizconlan commented 8 years ago

does anything interesting happen if we pull out the included link as :comment_author_url?

A quick test suggests that no, it doesn't change anything :(

lizconlan commented 8 years ago

Aha...

Rakismet.key = ENV['AKISMET_API_KEY']
Rakismet.url = 'http://10.10.10.30:3000/'
Rakismet.host = 'rest.akismet.com'
Rakismet.validate_key

data = {
  :comment_author => 'MelindaGates11',
  :comment_author_email => 'MelindaGates11@mail.ru',
  :comment_content => "http://XN--12C1CC9AL4E0G7A.COM\r\n\r\nHow to dislodge the lash The birds live and then need the eyelid area with a minimum thickness of 3 mm, when the birds are the gel fibre Lotus girls legs similar to a long cable to balance it on the fly. This is where the innovation is to import another type of characteristics of the domain renfin is made from stainless steel",
  :comment_date_gmt => '2016-05-18T13:10:28+01:00',
  :blog_lang => 'en,cy',
  :blog_charset => 'UTF-8',
  :user_ip => '127.0.0.1'
}

data[:comment_author_url] = "http://XN--12C1CC9AL4E0G7A.COM"
data[:comment_type] = "trackback"

Rakismet.akismet_call('comment-check', data)
# => "true"

edit: works without the :comment_content set (we can pass less data around) but may return true all the time instead

garethrees commented 8 years ago

👍 Nice. Maybe its just that particular account. Shouldn't be too bad to parse out the URLs.

garethrees commented 8 years ago

Looks like this is quite a recent problem:

SELECT date_trunc('month', users_with_counter_cache.created_at) AS signup_month, COUNT (*)
FROM (SELECT *, (SELECT COUNT(*) FROM info_requests WHERE user_id = users.id) AS info_requests_count
      FROM users) AS users_with_counter_cache
WHERE users_with_counter_cache.about_me LIKE '%http%' AND ban_text = ''
GROUP BY signup_month
ORDER BY signup_month ASC;
signup_month count
2007-12-01 00:00:00 1
2008-01-01 00:00:00 1
2008-02-01 00:00:00 1
2008-03-01 00:00:00 2
2008-05-01 00:00:00 3
2008-06-01 00:00:00 1
2008-07-01 00:00:00 2
2008-08-01 00:00:00 2
2008-09-01 00:00:00 3
2008-10-01 00:00:00 2
2008-11-01 00:00:00 2
2008-12-01 00:00:00 1
2009-01-01 00:00:00 1
2009-04-01 00:00:00 3
2009-05-01 00:00:00 2
2009-06-01 00:00:00 4
2009-07-01 00:00:00 2
2009-08-01 00:00:00 1
2009-10-01 00:00:00 1
2009-11-01 00:00:00 1
2010-02-01 00:00:00 4
2010-04-01 00:00:00 1
2010-05-01 00:00:00 2
2010-06-01 00:00:00 3
2010-07-01 00:00:00 4
2010-08-01 00:00:00 3
2010-09-01 00:00:00 3
2010-10-01 00:00:00 4
2010-11-01 00:00:00 4
2010-12-01 00:00:00 4
2011-01-01 00:00:00 3
2011-02-01 00:00:00 2
2011-03-01 00:00:00 1
2011-04-01 00:00:00 2
2011-05-01 00:00:00 1
2011-06-01 00:00:00 3
2011-07-01 00:00:00 1
2011-08-01 00:00:00 4
2011-09-01 00:00:00 4
2011-10-01 00:00:00 1
2011-11-01 00:00:00 3
2011-12-01 00:00:00 3
2012-01-01 00:00:00 2
2012-02-01 00:00:00 4
2012-03-01 00:00:00 1
2012-04-01 00:00:00 6
2012-05-01 00:00:00 2
2012-06-01 00:00:00 6
2012-07-01 00:00:00 79
2012-08-01 00:00:00 8
2012-09-01 00:00:00 20
2012-10-01 00:00:00 21
2012-11-01 00:00:00 88
2012-12-01 00:00:00 19
2013-01-01 00:00:00 83
2013-02-01 00:00:00 23
2013-03-01 00:00:00 15
2013-04-01 00:00:00 14
2013-05-01 00:00:00 20
2013-06-01 00:00:00 25
2013-07-01 00:00:00 16
2013-08-01 00:00:00 20
2013-09-01 00:00:00 18
2013-10-01 00:00:00 15
2013-11-01 00:00:00 13
2013-12-01 00:00:00 2
2014-01-01 00:00:00 12
2014-02-01 00:00:00 19
2014-03-01 00:00:00 40
2014-04-01 00:00:00 59
2014-05-01 00:00:00 32
2014-06-01 00:00:00 219
2014-07-01 00:00:00 63
2014-08-01 00:00:00 56
2014-09-01 00:00:00 128
2014-10-01 00:00:00 77
2014-11-01 00:00:00 72
2014-12-01 00:00:00 69
2015-01-01 00:00:00 73
2015-02-01 00:00:00 225
2015-03-01 00:00:00 81
2015-04-01 00:00:00 63
2015-05-01 00:00:00 340
2015-06-01 00:00:00 255
2015-07-01 00:00:00 504
2015-08-01 00:00:00 745
2015-09-01 00:00:00 622
2015-10-01 00:00:00 557
2015-11-01 00:00:00 609
2015-12-01 00:00:00 713
2016-01-01 00:00:00 821
2016-02-01 00:00:00 967
2016-03-01 00:00:00 957
2016-04-01 00:00:00 778
2016-05-01 00:00:00 519
garethrees commented 8 years ago

http://snook.ca/archives/other/effective_blog_comment_spam_blocker

garethrees commented 8 years ago

Spam words:

garethrees commented 8 years ago

Deleted around 1500 in an hour; there's about 9000 left so around 6.5 hours of work remaining to delete them all.

garethrees commented 8 years ago

Figure out what spam score constitutes an automatic ban

garethrees commented 8 years ago

I've banned everyone with 5 or above. That leaves just over 1000 users left to check.

# => {0=>38, 4=>1037}

Some users with a score of 4 were genuine, so best to do these by hand (bundle exec rake cleanup:spam_users). Need to merge and deploy #3308 before we do this to get the extra tweaks.

garethrees commented 8 years ago

For reference, here's what I was running in the console to give me some additional insight over the rake task:

spam_scorer = UserSpamScorer.new

results = {}
User.includes(:info_requests).
  where("info_requests.user_id IS NULL AND about_me LIKE '%http%' AND ban_text = '' AND confirmed_not_spam = False").
    order("users.created_at DESC").find_each do |user|
      results[user.id] = spam_scorer.score(user)
end

# Create a hash like { spam_score => count_of_user_ids }
counts = {}

results.values.each do |score|
  counts[score] = 0
end

results.each do |id,score|
  counts[score] += 1
end

# Create a hash like { spam_score => [user_ids] }
grouped = {}

results.values.each do |score|
  grouped[score] = []
end

results.each do |id, score|
  grouped[score] << id
end

# Manually set key to the spam score, e.g:
# key = 8

# Ban all users with a spam score of 8
grouped[key].each do |user_id|
  User.find(user_id).update_attributes!(:ban_text => 'Banned for spamming')
end
crowbot commented 8 years ago

Have checked the remaining users. We'll discuss at the next sprint planning meeting how we want to apply checks proactively to new user accounts.

crowbot commented 8 years ago

Will need a final revisit once we have proactive spamminess checks (https://github.com/mysociety/alaveteli/issues/3301) in place.

garethrees commented 8 years ago

Cleaned up ~600 spam accounts. We should revisit this in 2 weeks, then 3-6 months to check that the prevention techniques are working.

RichardTaylor commented 5 years ago

There are currently lots of spam user profiles. eg. search users for "www" to find those with links in their profiles all of the newer ones I've seen are spammy. They are still being created.

garethrees commented 3 years ago

This is starting to negatively affect us. We've been added to an adblocker list (no reason given, but spam user profiles is my suspicion). I've made a Pull Request to hopefully get us removed. I don't want to link to the PR from github because I don't really want to draw attention to this issue through its auto-referencing.

garethrees commented 3 years ago

This is starting to negatively affect us

Another issue is that our SEO is probably being negatively impacted. Here's a story of a targeted campaign. Our issue doesn't seem as targeted of course, but it wouldn't surprise me if that's part of the reason these spam networks exist. https://github.com/mysociety/alaveteli/issues/5461 would be good for this so that spam doesn't get indexed externally, though that obviously requires us actually identify the spam accounts (https://github.com/mysociety/alaveteli/issues/4628, https://github.com/mysociety/alaveteli/issues/5599, https://github.com/mysociety/alaveteli/issues/5304)

schlos commented 3 years ago

Hi Gareth, Please correct me if I misunderstood, but looking at the root cause:

Is that root cause assumption correct? If yes, then could we deter bots from creating spam accounts by not actively fighting them but rather to remove factor(s) they are using on a website for their own gain:

Hopefully when bots authors see they do not have results from creating spam accounts on Alaveteli sites, they will stop doing that (or at least reduce their activities).

garethrees commented 3 years ago

bots are creating spam accounts because they could use publicly visible profile data for their advert or spreading their message

Pretty much.

remove factor(s) they are using on a website for their own gain

We ideally want users to use their profile pages to help build more of a community within the site, but it is getting to the point where allowing public about me text isn't worth the effort. I guess we could only display profiles to logged in users.

garethrees commented 2 years ago

This is an ongoing issue rather than a one time task, so closing this in favour of issues that stop the problem or make it easier to handle the inevitable routine checking.