publiclab / plots2

a collaborative knowledge-exchange platform in Rails; we welcome first-time contributors! :balloon:
https://publiclab.org
GNU General Public License v3.0
961 stars 1.83k forks source link

spam tracker 3000 #4

Closed btbonval closed 6 years ago

btbonval commented 11 years ago

In order to determine how much spam is being received at any time, and how much is being handled by the community (admin/moderators), we should have some way to track historical spam even after it is deleted.

This might be as simple as something like a table that has a date that a node id and revision id were deleted, and possibly cache some information such as when the spam was created, who created it, and the last IP address of the creator. Usually this would be done relationally, but since the user, node, and revision might be deleted, this table should store copies of information and not pointers.

btbonval commented 11 years ago

This will also be used to calculate the efficacy of spam fighting tools, such as this proposed tool: https://github.com/jywarren/plots2/issues/195

btbonval commented 11 years ago

Quick and dirty way to check how many spam users are generated, it is safe to assume that PL does not legitimately generate over a thousand legit users a month.

mysql> SELECT MONTH(created_at) AS day, count(id) AS new_users FROM rusers GROUP BY MONTH(created_at) ORDER BY created_at;
+------+-----------+
| day  | new_users |
+------+-----------+
|    3 |         1 |
|    4 |        47 |
|    5 |        51 |
|    6 |       122 |
|    7 |       408 |
|    8 |     54602 |
|    9 |     63303 |
|   10 |     55204 |
|   11 |      9349 |
+------+-----------+
btbonval commented 11 years ago

Surprisingly the spam users appear to connect more than once. Less than 22% of users logged in only 1 time (or less), whereas I suspect spam users represent 80% or more of the legit users. 70% of users logged in only once or twice.

mysql> SELECT count(login_count) FROM rusers WHERE login_count < 2;
+--------------------+
| count(login_count) |
+--------------------+
|              41590 |
+--------------------+
1 row in set (0.10 sec)

mysql> SELECT count(login_count) FROM rusers WHERE login_count < 3;
+--------------------+
| count(login_count) |
+--------------------+
|             127715 |
+--------------------+
1 row in set (0.09 sec)

mysql> SELECT count(id) FROM rusers;
+-----------+
| count(id) |
+-----------+
|    183109 |
+-----------+
1 row in set (0.09 sec)
btbonval commented 11 years ago

Given my assumption that real users are now an outlier, an arithmetic average should tell us something about spammer behavior. The average PublicLab.org user logs in 2.3 times.

mysql> SELECT avg(login_count) FROM rusers;
+------------------+
| avg(login_count) |
+------------------+
|           2.3742 |
+------------------+
btbonval commented 11 years ago

Some spam usernames must get recycled or something. The list of people who have connected more than 50 times is mostly spam users! Out of curiosity, I checked to see how many are not yet banned. Only 1 was banned.

mysql> SELECT username FROM rusers WHERE login_count > 50;
+---------------+
| username      |
+---------------+
| warren        |
| liz           |
| btbonval      |
| donblair      |
| Benediktagwd  |
| felony61screw |
| dramawalk3    |
| mallet6harp   |
| coinarrow32   |
| mesyjxsoht    |
| quillcry8     |
| cnuberft      |
| drainwrist21  |
| adubbelve     |
| degree7cave   |
| candlebarge62 |
| packetwine17  |
| debtepoxy9    |
| iron0liquor   |
+---------------+
19 rows in set (0.10 sec)

mysql> SELECT username FROM rusers, users WHERE username = name AND login_count > 50 AND status != 0;
+---------------+
| username      |
+---------------+
| warren        |
| liz           |
| btbonval      |
| donblair      |
| Benediktagwd  |
| felony61screw |
| dramawalk3    |
| mallet6harp   |
| coinarrow32   |
| mesyjxsoht    |
| quillcry8     |
| cnuberft      |
| drainwrist21  |
| degree7cave   |
| candlebarge62 |
| packetwine17  |
| debtepoxy9    |
| iron0liquor   |
+---------------+
18 rows in set (0.11 sec)
btbonval commented 11 years ago

Fun way to grab a bunch of spam-like usernames. Usually the format is {noun}{noun}{number} or {noun}{number}{noun}. Conveniently, few if any legit users have numbers in their usernames.

I dropped some banhammer on 50+ users (after reading through for any legit ones) using this WHERE clause:

SELECT users.uid, username FROM rusers, users WHERE username = name AND login_count > 25 AND status != 0 AND name RLIKE '[^0-9]+[0-9]{1,2}[^0-9]*';

login_count > 25 suggests that the spam name is reused a lot. It also brought the list of names down to something manageable to read and verify.

Maybe I banned some valid users? They must have at least one non-numeric character, then precisely one or two numerals, and then maybe some more non-numeric characters afterwards. Unsure why these usernames are so formulaic, but it helps.

btbonval commented 10 years ago

pushed spam prevention to master. did not yet deploy. current spam stats:

mysql> SELECT MONTH(created_at) AS day, count(id) AS new_users FROM rusers GROUP BY MONTH(created_at) ORDER BY created_at;
+------+-----------+
| day  | new_users |
+------+-----------+
|    3 |         1 |
|    4 |        47 |
|    5 |        51 |
|    6 |       122 |
|    7 |       408 |
|    8 |     54602 |
|    9 |     63303 |
|   10 |     55204 |
|   11 |     50884 |
|   12 |     25567 |
+------+-----------+
btbonval commented 10 years ago

spam prevention is deployed.

halfway through the month, we have about 25,000 new users already. Given the usual ~50,000 per month, if this anti-spam measure isn't effective, we'll see 50,000 at the end of the month as per usual.

If this measure is effective, I want to see at least an order magnitude drop in new users. Instead of 50,000, I want to see 5,000. Since this is a half month, I want no more than 2,500 new users for the rest of the month (added to the 25,000 already created).

So if its effective, we'll have less than 28,000 new users in December. A little higher is worth keeping for another full month, a lot higher means it doesn't really help and should be removed.

btbonval commented 10 years ago

Figured out how to check new users in the old users table:

mysql> SELECT YEAR(FROM_UNIXTIME(created)) AS year, MONTH(FROM_UNIXTIME(created)) AS month, COUNT(uid) AS users FROM users GROUP BY year, month ORDER BY year, month;
+------+-------+-------+
| year | month | users |
+------+-------+-------+
| 1970 |     1 |     1 |
| 2010 |    11 |     2 |
| 2010 |    12 |    12 |
| 2011 |     1 |    18 |
| 2011 |     2 |    49 |
| 2011 |     3 |    25 |
| 2011 |     4 |    15 |
| 2011 |     5 |    16 |
| 2011 |     6 |    27 |
| 2011 |     7 |    31 |
| 2011 |     8 |    65 |
| 2011 |     9 |    22 |
| 2011 |    10 |    37 |
| 2011 |    11 |    44 |
| 2011 |    12 |    33 |
| 2012 |     1 |    67 |
| 2012 |     2 |   103 |
| 2012 |     3 |   179 |
| 2012 |     4 |   250 |
| 2012 |     5 |   241 |
| 2012 |     6 | 19239 |
| 2012 |     7 | 19559 |
| 2012 |     8 |  2238 |
| 2012 |     9 |  1898 |
| 2012 |    10 |   950 |
| 2012 |    11 |  1285 |
| 2012 |    12 |   841 |
| 2013 |     1 |  2174 |
| 2013 |     2 |  2457 |
| 2013 |     3 |  3863 |
| 2013 |     4 |  4881 |
| 2013 |     5 | 36550 |
| 2013 |     6 | 51555 |
| 2013 |     7 | 21635 |
| 2013 |     8 | 54581 |
| 2013 |     9 | 63290 |
| 2013 |    10 | 55195 |
| 2013 |    11 | 50860 |
| 2013 |    12 | 25564 |
+------+-------+-------+
39 rows in set (0.79 sec)

As long as the users count is the same as the rusers count for any given month, the old site is not creating users. There is a little bit of strangeness though: Starting in about August it looks like the numbers track each other pretty closely, but the users table is always shy of the rusers table (looks like its off by no more than 20 users).

This is odd because new users are created in both rusers and users. The users table shouldn't be falling behind. Oh well, different problem and not worth investigating for now.

btbonval commented 10 years ago

Current status after adding some spam intervention:

+------+-----------+
| day  | new_users |
+------+-----------+
|    3 |         1 |
|    4 |        47 |
|    5 |        51 |
|    6 |       122 |
|    7 |       408 |
|    8 |     54602 |
|    9 |     63303 |
|   10 |     55204 |
|   11 |     50884 |
|   12 |     25801 |
|    1 |       365 |
+------+-----------+

Halfway through December we had 25564, and by the end we had only 25801 (237 users added). January is about done, and we've had 365 users.

Soooooo that's WAY less than 50,000! Two orders of magnitude! We've had two or three emails asking about the signup page, as a result Jeff and I have enhanced the anti-spam bits to be a little easier to understand.

jywarren commented 10 years ago

Yay!!!!!!!!!

btbonval commented 10 years ago

Into the wee early days of March, we can see Jan and Feb are fairly consistent at ~500 new users per month.

...
| 2013 |    10 | 55192 |
| 2013 |    11 | 50851 |
| 2013 |    12 | 25789 |
| 2014 |     1 |   439 |
| 2014 |     2 |   596 |
| 2014 |     3 |    42 |
+------+-------+-------+
btbonval commented 10 years ago

web@ has been receiving emails from various companies complaining of being link poisoned and having their SEO damaged by our site. The nice thing is that each company which emails us provides a very solid lead for finding and removing spam users.

Today an email came in identifying 10 user profiles with spam links. Using the search term through profiles, I found 18 users which linked that that URL or a similar Facebook URL. All 18 were obviously spam users. I deleted the profiles, then I deleted the users in both user tables.

I need to find a way to properly delete users so that all tables created for a user are removed. Usually this is taken care of in relational databases with foreign keys and cascades, but we aren't so lucky with either the Drupal drippings or the current Rails code... or maybe it is the MySQL tables themselves ignoring the foreign keys. Regardless, I have no doubt there are a number of tables which need to be scrubbed of content from users which no longer exist.

btbonval commented 10 years ago

Let me clarify: properly batch delete users. I'm not about to go through 18 users, profile by profile, to delete them.

jywarren commented 10 years ago

Hi bryan, rails (active record) keeps track of dependent=>destroy relationships, and is supposed to maintain those as long as you do deletions through the rails console. So if you type

rails console

Then

u = DrupalUser.find blablabla
u.each do |user|
  user.delete
end

All dependent records will also be destroyed.

btbonval commented 10 years ago

Thanks for the example code, Jeff. I'll have to figure out how to run a rails console in the context of our publiclab.org database. I'm guessing I can just switch to your user and run it in the home directory. Do I need to call rvm or is that not a thing on plots2 system?

jywarren commented 10 years ago

you should be able to run the shell command "rails console" as any user, from the root directory of the project. That'll dump you into the console.

btbonval commented 10 years ago

Haven't run this in awhile. Looks like we're fluctuating around 1000 new users a month on average now.

...
| 2014 |     1 |   439 |
| 2014 |     2 |   596 |
| 2014 |     3 |   829 |
| 2014 |     4 |  1057 |
| 2014 |     5 |  1367 |
| 2014 |     6 |   910 |
| 2014 |     7 |   884 |
| 2014 |     8 |  1489 |
| 2014 |     9 |  1505 |
| 2014 |    10 |   871 |
+------+-------+-------+
btbonval commented 10 years ago

Something changes in Feb/March that moved us from ~500 new users per month to ~1000 new users per month. Possibly another bump July to August that moved us from ~1000 new users per month to ~1500 new users per month, but that trend isn't as obvious yet.

btbonval commented 10 years ago

I realized I have not been keeping track of my spam user net outside of the database. Here's the current view to detect unbanned spammers based on key URLs found in their profile:

CREATE OR REPLACE VIEW `spammers` AS select `users`.`uid` from (`profile_values` join `users`) where ((`users`.`uid` = `profile_values`.`uid`) and (`users`.`status` = 1) and ((`profile_values`.`value` like '%online casino%') or (`profile_values`.`value` like '%M88%') or (`profile_values`.`value` like '%webeden%') or (`profile_values`.`value` like '%thyself%') or (`profile_values`.`value` like '%vilago21%') or (`profile_values`.`value` like '%11alive%') or (`profile_values`.`value` like '%insanejournal%') or (`profile_values`.`value` like '%babycenter%') or (`profile_values`.`value` like '%yapperz%') or (`profile_values`.`value` like '%wheretogetengaged%') or (`profile_values`.`value` like '%!! the name is %') or (`profile_values`.`value` like '%jigsy.com%') or (`profile_values`.`value` like '%xanga.com%') or (`profile_values`.`value` like '%kaneva.com%') or (`profile_values`.`value` like '%foodbuzz.com%') or (`profile_values`.`value` like '%diigo.com%') or (`profile_values`.`value` like '%advancedmedicalcertification%') or (`profile_values`.`value` like '%sergeantshredder%') or (`profile_values`.`value` like '%wouldyouratherquestions%') or (`profile_values`.`value` like '%onlinecheck.com%') or (`profile_values`.`value` like '%eroupasdebebe%') or (`profile_values`.`value` like '%ilschool.de%') or (`profile_values`.`value` like '%globalair.com') or (`profile_values`.`value` like '%flightaware.com%') or (`profile_values`.`value` like '%mundocompresores.com.ar%') or (`profile_values`.`value` like '%forrestledbetter.com%') or (`profile_values`.`value` like '%iamsport.org%') or (`profile_values`.`value` like '%inmacbook.net%') or (`profile_values`.`value` like '%achatia.org%') or (`profile_values`.`value` like '%facebook.com.bd%') or (`profile_values`.`value` like '%hijackedradio.com%') or (`profile_values`.`value` like '%franmich.com%') or (`profile_values`.`value` like '%docspal.com%') or (`profile_values`.`value` like '%hotfile.com%') or (`profile_values`.`value` like '%free.yudu.com%') or (`profile_values`.`value` like '%awebcafe.com%') or (`profile_values`.`value` like '%go2album.com%') or  (`profile_values`.`value` like '%marketing%services%') or (`profile_values`.`value` like '%adhocspace.com.sg%') or (`profile_values`.`value` like '%storenvy.com%') or (`profile_values`.`value` like '%tagzania.com%') or (`profile_values`.`value` like '%thisismarilyn.com%') or (`profile_values`.`value` like '%ppjaponesia.org%') or (`profile_values`.`value` like '%photopeach.com%') or (`profile_values`.`value` like '%games.337.com%') or (`profile_values`.`value` like '%ecoawnings.com.au%') or (`profile_values`.`value` like '%infocentre.santabarbara.cat%')));

When the table is non-empty, I will sweep up the mess with this command:

UPDATE users SET status=0 WHERE uid IN (SELECT uid FROM (SELECT * FROM spammers) AS alias);

(nested subqueries required because the view reads the table being updated)

Generally there are no additions to the spammer list unless the spammer list has new URLs added to it.

btbonval commented 10 years ago

I estimate we gained about 225,000 spam user accounts total over the 5 month botnet attack. We have about 250,000 migrated accounts, which would further estimate about 25,000 actual users.

Of the migrated accounts, about 225,000 are not banned (compared to an estimated 25,000 legit users). Lots of work to be done still!

SELECT count(users.uid) FROM users, rusers WHERE users.uid = rusers.id AND users.status = 1;
ebarry commented 10 years ago

Bryan this is an incredible piece of long term work that is useful for directly improving anti-spam on publiclab.org and also at a meta level for understanding stats on our community.

btbonval commented 10 years ago

Thanks @ebarry . It's certainly not a usual "fix it and close it" ticket, more like a weather forecast. That's why I named it like a weather forecast :)

btbonval commented 10 years ago

Given an estimated 7% chance of grabbing a legit user from the pool of unbanned, migrated users with a uniform draw, it should be fun to pull users at random.

This cannot be done in Rails efficiently because the User and DrupalUser are not properly associated to check for both unbanned status and migrated user with a single join. I tried.

How to grab a random migrated, unbanned user profile link from the database:

SELECT CONCAT('http://publiclab.org/profile/', rusers.username) FROM rusers, users WHERE rusers.id = users.uid AND users.status = 1 ORDER BY RAND() LIMIT 1;

Odds are high that the profile returned will be spam!

btbonval commented 10 years ago

Polling random users does not disappoint. 15 out of 15 were spam users. I updated the spam finder view with more key words and banned somewheres around 10,000 users. Statistics in previous comments have been updated.

btbonval commented 9 years ago

The spammers view is built from domains and keywords which are almost certainly coming from spammers which end up in their profiles.

It might be worth looking into external lists of blacklisted spam domains/keywords. There were intended for email, but same principle applies.

btbonval commented 9 years ago

The spammers view is taking a very long time to run. Text searching is fairly slow, and with so many qualifiers to search against, it runs even more slowly.

It might be a good idea to start breaking the query apart using something like a priority queue:

  1. Put recent reports into the highest priority query and ban against that.
  2. After the highest priority comes up with zero, start moving some of those query strings out into a lower priority query.
  3. Run each lower priority queue with less frequency than higher queues. If a domain is found in a lower query, bump it to the top again.

At that point, storing it in the database as a view would be onerous due to updating which domain goes where. Instead, it might be better to run them as queries by external software and automate the process to some comfortable degree.

btbonval commented 9 years ago

SQL query to grab when banned user accounts were most recently updated (by access, login, or creation timestamp), and then count the total such users for each year. Triple nested query (two subqueries) ... not sure how to make that nicer though.

SELECT YEAR(FROM_UNIXTIME(`when`)) AS year, count(uid) FROM (SELECT MAX(`when`) AS `when`, uid FROM (SELECT created AS `when`, uid FROM users WHERE status = 0 UNION SELECT access AS `when`, uid FROM users WHERE status = 0 UNION SELECT login AS `when`, uid FROM users WHERE status = 0) AS useraccess GROUP BY uid) AS lastuseraccess GROUP BY year ORDER BY year;

Presently that looks like this:

+------+------------+
| year | count(uid) |
+------+------------+
| 1970 |          1 |
| 2011 |         45 |
| 2012 |       4003 |
| 2013 |      37698 |
| 2014 |        142 |
+------+------------+
5 rows in set (0.98 sec)

Assuming 2014 is over with (true in a few weeks), then 41746 banned users have not access their accounts in any way in at least one year. That is the summation of 2011, 2012, and 2013 in the table above. That is out of a total of 41,889 banned users. That is out of an estimated 225,000 total spam users.

I think these user accounts are good candidates to be removed from the database. An additional test can be added to see if any valid research notes have been posted, and skip those users. It looks like node.status would be how to tell if a research note is valid.

Current stats for blocked and unblocked notes:

SELECT status, count(*) FROM node GROUP BY status;
+--------+----------+
| status | count(*) |
+--------+----------+
|      0 |     1896 |
|      1 |     3753 |
+--------+----------+
2 rows in set (0.01 sec)

Assuming status means the same thing in node as it does user, then about 1/3 of the notes have been blocked (33.56%)

ryzokuken commented 7 years ago

@yukiisbored What's the status on this?

yukiisbored commented 7 years ago

@ryzokuken ah, I think I added that by mistake.

yukiisbored commented 7 years ago

We can ban known spammer/scammer/proxy IPs, I think there's an REST API for those.

ryzokuken commented 7 years ago

@jywarren Are we planning on working on this?

jywarren commented 6 years ago

I think that with the new OAuth work https://github.com/publiclab/plots2/milestone/18 , there are lots of new spam accounts being made, but none of them are successfully getting spam posts through...

ebarry commented 6 years ago

wow blast from the past!!!! Hi everyone!

--

+1 336-269-1539 / @lizbarry http://twitter.com/lizbarry / lizbarry.net

On Tue, Jul 24, 2018 at 4:10 PM, Jeffrey Warren notifications@github.com wrote:

I think that with the new OAuth work https://github.com/publiclab/ plots2/milestone/18 , there are lots of new spam accounts being made, but none of them are successfully getting spam posts through...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4#issuecomment-407535855, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJ2n54qZ9-DBpsYCop2Y9e84UmnXYKsks5uJ39AgaJpZM4BI_HU .

btbonval commented 6 years ago

Hi everyone. I had not seen any of the questions or updates to this ticket until recently. My email seems to be 50/50 spam or archived folder. I see a bunch of questions from over a year ago. For those:

The original intent of this ticket was to have shared notes on the process and queries used to find and eliminate spam user accounts. There was never an end goal for this ticket.

I think it should be fine to close this out. The process and queries will still be available in the ticket after it is closed.

jywarren commented 6 years ago

Hi, @btbonval -- good to hear from you! Miss you!

Thanks, and we have a good overall spam issue now at #974 for what it's worth. I guess I'll close this since there's not a specific to-do or end goal, and we can trace back to it from #974. Cool!

ebarry commented 6 years ago

I will never forget Issue #4 We had a good run over these years <3

Spam Tracker 3000 will live on in our collective meme-ories, and hopefully meme-orialized on publiclab.org/spam

--

+1 336-269-1539 / @lizbarry http://twitter.com/lizbarry / lizbarry.net

On Fri, Aug 3, 2018 at 4:01 PM, Jeffrey Warren notifications@github.com wrote:

Hi, @btbonval https://github.com/btbonval -- good to hear from you! Miss you!

Thanks, and we have a good overall spam issue now at #974 https://github.com/publiclab/plots2/issues/974 for what it's worth. I guess I'll close this since there's not a specific to-do or end goal, and we can trace back to it from #974 https://github.com/publiclab/plots2/issues/974. Cool!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4#issuecomment-410361251, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJ2n3ryR0bfEXzo7Tj6G_vJNYwCd99jks5uNKwlgaJpZM4BI_HU .