matomo-org / referrer-spam-list

Community-contributed list of referrer spammers. Comment +1 in any issue or Pull request and the spammer will be added to the list!
https://matomo.org/blog/2015/05/stopping-referrer-spam/
666 stars 302 forks source link

Peer review process for referral spam hosts #26

Closed kingo55 closed 8 years ago

kingo55 commented 9 years ago

Brilliant idea guys.

What are the requirements for adding a bad referrer to the list? As @mnapoli mentioned in another thread - don't want to make it too broad.

I'm thinking of a process where new referral spammers are added to the list by peer review. Possibly by having other members with significantly large Piwik/Snowplow data sets to vouch for them.

mnapoli commented 9 years ago

I agree with this. We have a fairly good list at this point, we should be careful with new additions. We could decide that every new issue or pull request needs a +1 from another person before being accepted.

Even if that means some additions will have to wait for a few days I think it's fine.

Ping @mattab

mattab commented 9 years ago

Good question. I think we can merge PRs sometimes on first read, when there is only a few domains and names look spammy, or eg. when PR author explains how she found the spammers (eg. in GA/Piwik reports, 100% bounce rate, display spam, dodgy whois, found on another referrer spam blacklist, etc.)

If we're not sure, then sounds good to ask other users to +1 if they also see this spammer and merge after a +1 was commented.

Maybe we leave this issue opened for a while & see how this evolves?

calebpaine commented 9 years ago

Also I'm seeing some pull requests for larger lists, perhaps each PR should be limited to one domain/url each, that way they can be individually vetted?

desbma commented 9 years ago

I have noticed spammers usually spam a lot of different domains from the same IPs.

Once an IP has spammed at least one domain in the blacklist, it is easy to find new domains being spammed (by grepping the IPs on server logs), and add them to the list, without any risk of false positive.

I have automated the research of new domains using this approach and the result is in pull request #87.

mnapoli commented 9 years ago

FYI we have been contacted by a webmaster asking for his website to be removed from the list: #90 (see the details in the pull request).

I think this mistake (if it is one) should be one more reason to move to a "peer-review-only" kind of process, i.e. add only sites that have been reported or approved by at least 2 people. We should also document in the README that it's better to add one site at a time in pull request (I'll do it straight away), we should avoid "bulk changes" because they are harder to validate.

Thoughts?

desbma commented 9 years ago

I can only speak for myself, but I have seen in the recent months an important increase in referer spam.

They spam from dozens of IPs a lot of different domains, sometimes without any rate limiting, so I get bursts of dozens of useless requests per second, polluting my analytics and wasting my server ressources. And this is on small servers hosting a few low traffic sites.

As soon as I detect referer spam from an IP, I now automatically block it at the firewall level, despite that I see new domains being spammed from new IPs every day.

Most of theses domains are registered for a short period of time, are simple redirects, and the spammers will always register new ones to spam.

I don't use Piwik, but I find this list very useful, however let's be honest: if you require a separate pull request and a vote on every domain added, this list will not be updated frequently (if at all), and it will become useless before a few months.

mnapoli commented 9 years ago

Up to now most pull requests (that have been merged) contained only a single domain (because we also add ourselves the domain reported in issues). If spam is more and more an issue, there will be more and more people looking for solutions, and thus contributing here.

When we started working on a new solution against referrer spam I suggested the following idea: build a submission system where users can submit new spammers directly from inside their Piwik. These submissions would be sent to a simple app hosted somewhere (e.g. spam.piwik.org). Then it would be easy to see how many users reported each spammer domain, and above a threshold (or manually) we could add the domain to the blacklist.

desbma commented 9 years ago

I hope this repository will get enough activity to make this list useful, but I fear the spammers will always be faster than you.

Anyway, since it's in the public domain, I will maintain and use my fork, and merge back changes from this list.

desbma commented 9 years ago

@mnapoli FYI qitt.ru is definitely being spammed.

I was able to detect it, and add it to the list again because of the other domains being spammed from the same IP.

See an excerpt of my server logs:

$ zgrep -F 178.137.87.228 /var/log/apache2/*.access.log*
/var/log/apache2/[REMOVED].access.log:178.137.87.228 - - [02/Aug/2015:18:08:15 +0200] "GET / HTTP/1.1" 200 8534 "http://torrnada.ru/" "Opera/7.54 (Windows NT 5.1; U)  [pl]"
/var/log/apache2/[REMOVED].access.log:178.137.87.228 - - [03/Aug/2015:01:15:43 +0200] "GET / HTTP/1.1" 200 8534 "http://msk.afora.ru/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [30/Jul/2015:14:55:22 +0200] "GET / HTTP/1.1" 200 8534 "http://portal-eu.ru/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [30/Jul/2015:14:55:23 +0200] "GET / HTTP/1.1" 200 8534 "http://portal-eu.ru/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [31/Jul/2015:14:56:15 +0200] "GET / HTTP/1.1" 200 8534 "http://bioca.org/" "Mozilla/3.0 (x86 [en] Windows NT 5.1; Sun)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [31/Jul/2015:22:08:43 +0200] "GET / HTTP/1.1" 200 8534 "https://www.qitt.ru/" "Opera/7.60 (Windows NT 5.2; U)  [en] (IBM EVV/3.0/EAK01AG9/LE)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [31/Jul/2015:22:08:44 +0200] "GET / HTTP/1.1" 200 8534 "https://www.qitt.ru/" "Opera/7.60 (Windows NT 5.2; U)  [en] (IBM EVV/3.0/EAK01AG9/LE)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [31/Jul/2015:22:08:45 +0200] "GET / HTTP/1.1" 200 8534 "https://www.qitt.ru/" "Opera/7.60 (Windows NT 5.2; U)  [en] (IBM EVV/3.0/EAK01AG9/LE)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:05:11:58 +0200] "GET / HTTP/1.1" 200 8534 "http://fitness-video.net/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.40607)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:05:11:58 +0200] "GET / HTTP/1.1" 200 8534 "http://fitness-video.net/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.40607)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:05:11:59 +0200] "GET / HTTP/1.1" 200 8534 "http://fitness-video.net/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.40607)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:07:47:43 +0200] "GET / HTTP/1.1" 200 8534 "http://m1media.net/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.0 [en]"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:18:07:18 +0200] "GET / HTTP/1.1" 200 8534 "http://education-cz.ru/godovoy-podgotovitelnyy-kurs" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Crazy Browser 1.0.5)"
/var/log/apache2/[REMOVED].access.log.1:178.137.87.228 - - [01/Aug/2015:18:07:19 +0200] "GET / HTTP/1.1" 200 8534 "http://education-cz.ru/godovoy-podgotovitelnyy-kurs" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Crazy Browser 1.0.5)"

The webmaster that contacted you is probably contracting a shady SEO company using a botnet to send massive referer spam without his knowledge.

mnapoli commented 9 years ago

The webmaster that contacted you is probably contracting a shady SEO company using a botnet to send massive referer spam without his knowledge.

Could be that indeed. Or it could be that there are multiple websites hosted on the same machine? Or multiple servers behind the same IP?

desbma commented 9 years ago

Or it could be that there are multiple websites hosted on the same machine? Or multiple servers behind the same IP?

Mhh, not sure I understand you. The server's IP and hosting are unrelated to this client that is actually spamming several domains at once.

EDIT: I realize I may not have given you enough context: the above log excerpt is from a server I own, which hosts very small websites, unrelated to the referers you see. This is clearly referer spam, all domains are from Russia, requests are sent with a few seconds interval, all from the same IP, with randomized user agents....

mnapoli commented 9 years ago

Sorry it's late :/ Rephrasing my thoughts better: the spammer tool (whatever it's form) could run from a server which has the same IP address as valid websites. For example it would be very easy to write a referrer spammer script that runs on any shared host. Thus blocking based on the IP address might not always be reliable.

desbma commented 9 years ago

Even if a spammer script is running on a shared host, that hosts some websites, they are not supposed to send requests to other websites, no?

mnapoli commented 9 years ago

They aren't supposed to spam indeed, but my point is that websites of the shared host are not aware that other users of the servers are doing that, and can be blocked as a collateral damage (in the case where they send actual referrers to spammed websites). All in all, the IP address isn't 100% reliable. It's the same problem for blocking e.g. gamers online, or when blacklisting from connecting to SSH, etc… People/servers can also be in a sub-network and share the same external IP address (companies, universities, etc.).

desbma commented 9 years ago

What I meant is that even if a good website is behind the same IP as a spammer, if you block that IP on your server to protect yourself from the spam, the good website is unaffected, because it is not sending HTTP requests anyway (only serving it, and not to your server).

By the way "blocking" the IP is the list's user choice, we are only talking about adding domains that are obviously being spammed (qitt.ru & co) to the list here.

mnapoli commented 9 years ago

What I meant is that even if a good website is behind the same IP as a spammer, if you block that IP on your server to protect yourself from the spam, the good website is unaffected, because it is not sending HTTP requests anyway (only serving it, and not to your server).

That's not how it works in Piwik: when receiving data, Piwik will exclude any data where the referrer is blacklisted. So if a good website is in the blacklist, it will be affected because its referrer traffic (traffic going from the good website to other websites tracked with Piwik) will be ignored. It will also affect users of Piwik as well because there will be valid traffic going through their website that will be ignored by Piwik.

By the way "blocking" the IP is the list's user choice, we are only talking about adding domains that are obviously being spammed (qitt.ru & co) to the list here.

There is a misunderstanding here, I'm not talking about user blocking an IP. I'm talking about the methodology you suggest to add new spammers to the blacklist. This is how you explained it:

I have noticed spammers usually spam a lot of different domains from the same IPs. Once an IP has spammed at least one domain in the blacklist, it is easy to find new domains being spammed (by grepping the IPs on server logs), and add them to the list, without any risk of false positive.

What I'm saying is that if we add spammers to the blacklist like this, we might blacklist good websites. That would be hurting both good websites and Piwik users.

Example:

We detect badwebsite.com and blacklist it. We see that badwebsite.com comes from IP 1.2.3.4, and we see that referrer goodwebsite.com too. With your idea we would blacklist goodwebsite.com.

Am I understanding it right?

desbma commented 9 years ago

That's not how it works in Piwik: when receiving data, Piwik will exclude any data where the referrer is blacklisted. So if a good website is in the blacklist, it will be affected

We are on the same page on that, we should not add good domains to the blacklist.

because its referrer traffic (traffic going from the good website to other websites tracked with Piwik) will be ignored.

This is where you lost me. Traffic never goes from website to website. A HTTP client sends a request to a website that has another website's domain in the referer header. Blocking the client's (spammer) IP has absolutely no effect to the flow between the server receiving the spammy requests and the website whose domain in the referer header.

Example:

goodwebsite.com has IP 1.2.3.4
badwebsite.com has IP 1.2.3.4 too (shared host or private network, etc.)
goodwebsite.com sends referral traffic to myprettyponey.com
badwebsite.com runs a script that spams myprettyponey.com with false referrer (to promote badwebsite.com, or any other website)

We detect badwebsite.com and blacklist it. We see that badwebsite.com comes from IP 1.2.3.4, and we see that referrer goodwebsite.com too. With your idea we would blacklist goodwebsite.com.

Am I understanding it right?

An example is a good idea :) I think you misunderstand how HTTP referer works, especially that :

goodwebsite.com sends referral traffic to myprettyponey.com

When we say that a website "sends referal traffic" to another website, there is never any direct communication between the two servers.

What actually happens is the following (I reuse your example):

  1. M. ISurfTheWebMyWithBrowser visits goodwebsite.com
  2. M. ISurfTheWebMyWithBrowser clicks a link to myprettyponey.com
  3. The browser of M. ISurfTheWebMyWithBrowse generates a HTTP request to myprettyponey.com with a "Referer: goodwebsite.com" header

Now if in the same time, the spammer with the same IP as goodwebsite.com (1.2.3.4) sends HTTP requests to myprettyponey.com with "Referer: pornvidzlolwut.ru", what will happen is that we will block the IP 1.2.3.4. So HTTP traffic will be blocked between the two servers, but M. ISurfTheWebMyWithBrowser can surf as usual because traffic coming from usual clients is unaffected.

mnapoli commented 9 years ago

We are still not talking about the same thing :) We understand each other on how HTTP referrer works, and again I am not talking about blocking an IP address.

I am talking about adding domains to the blacklist based on their IP addresses. In other words:

I'm talking about the methodology you suggest to add new spammers to the blacklist.

You suggest we judge wether a referrer is a spam based on the IP address of the client. But the IP address of the client could be shared for many reasons.

Here is another example:

In your logs, you will see:

1.2.3.4 - [02/Aug/2015:18:08:15] "GET / HTTP/1.1" 200 8534 "http://badwebsite.com"
1.2.3.4 - [02/Aug/2015:18:08:15] "GET / HTTP/1.1" 200 8534 "http://google.com"

If we follow your methodology:

desbma commented 9 years ago

Right, in that case there is a conflict, but if a website is hosted on 1.2.3.4, it is unaffected.

If a university or similar can't secure it's own network and outgoing traffic, I see no problem to block traffic from it. For example, this is how Google, and tons or other services work. If you send automated requests to Google from a public IP, after some point Google will send you a captcha, other services will just block you. If you are on a university network, too bad the public IP will get blocked, but that rarely happens because there is usually a proxy that does rate limiting, has several public IPs etc, etc.

Anyway the false positive scenario you describe is possible but very unlikely. We all know the domains mentioned above are spam. You can be super cautious about adding new domains, but it will hurt your interests in the end.

The increase in spam I see makes no doubt that this is a large scale operation. Soon your Piwik users will wonder why their sites are becoming so popular in Russia ;)

mnapoli commented 9 years ago

For the record I've created a tag waiting confirmation and tagged issues and pull requests.

calebpaine commented 9 years ago

I think blocking IP addresses, or other sites that share IPs with a known spammer are a bad idea. You'll get tons of legitimate domains that are false positives because they just happen to be on the same shared host (such as Godaddy for example) as a spammer.

I also don't see this list as being a real-time instant update, so automated pull requests or additions to the list a no-go. This list needs to be added & vetted by other administrators. I don't mind if it takes a couple of days for a new domain to be formally added, that wont adversely affect the weekly, monthly, & yearly stats.

desbma commented 9 years ago

I think blocking IP addresses, or other sites that share IPs with a known spammer are a bad idea. You'll get tons of legitimate domains that are false positives because they just happen to be on the same shared host (such as Godaddy for example) as a spammer.

This is not what I proposed. What I suggested is that as soon as we identify referer spam from an IP, we consider all domains sent as referer from that IP as spam, and enrich the list with the new domains.

Websites on shared hosts do not send requests and are not concerned.

For the false positives concern: I added 44 domains since I started my fork 11 days ago, and you can check for yourself, they are all spam, 100% guaranteed :smile: I check when I have a doubt, and they are mostly small variations of domains already in the list, or the classic porn or seo crap, all from Russia.

mnapoli commented 8 years ago

We are currently doing peer review for merging pull requests and it works well, let's close this issue!