mariusv / nginx-badbot-blocker

Block bad, possibly even malicious web crawlers (automated bots) using Nginx
861 stars 140 forks source link

Is this repository still updated regularly? #7

Closed sarukuku closed 7 years ago

sarukuku commented 8 years ago

I'm just thinking if the configs are safe for production use as is.

mariusv commented 8 years ago

Heya,

I personally use them for production on quite few servers. The IP blacklist sadly I don't have enough time to update as often as I would want but the rest is updated as soon as a new "badbot" appears. I'm thinking to write a script which will automatically build and update the IP blacklist list soon. Also if you have any issues you can just ping me and I would be more than happy to help.

iMaxopoly commented 8 years ago

Hi, great work with the maintenance on this. I hope the users will monitor for any false positives. Can do more harm than good when that happens.

At the time of reading this I see two files, globalblacklist.conf and blacklist.conf

I'm assuming globalblacklist.conf is the more updated version of the same. Is it not?

davidchalifoux commented 8 years ago

@kryptodev It seems to be so. However they just added it yesterday so it could be a WIP still.

mariusv commented 8 years ago

Yes, globalblacklist.conf is a new contribution of @mitchellkrogza. As far as I understood from him is using this for its own servers and I hope that this weekend I will incorporate it to blacklist.conf .

mitchellkrogza commented 8 years ago

Hi guys

Yes I modified this based on Marius' original blacklist.conf and then added some snippets and a much more extensive list of bad referers from a list I found on perishable press. Marius was kind enough to allow the PR into his repo.

I also added a list of Cyveillance ( LookingGlass Cyber Solutions) IP's who apparently scan and sniff around for all sorts of stuff. Have included some information from Wikipedia below about them.

I made it a globalblacklist.conf as I hate too many include files lying around Nginx, the less there are, the less places there are to diagnose.

I spent a few hours compiling it and did not take each and every thing I found on the web and just include it, I even stripped out one or two bad referers that had the word "image" or "pic" in it as my main site is all about photos, pics and images so I don't want any false positives there.

Certainly needs monitoring and tweaking as time goes but I think it is pretty solid for most sites.

By all means let me know feedback and log any issues on the repo if you have any.

Have fun with it.

Kind Regards Mitch

Numerous websites have complained about Cyveillance's traffic for the following reasons:

  1. Their robots https://en.wikipedia.org/wiki/Web_crawler access many pages, and thus use a comparatively large amount of bandwidth.^[/citation needed https://en.wikipedia.org/wiki/Wikipedia:Citation_needed/]
  2. Their robots https://en.wikipedia.org/wiki/Web_crawler send many fake HTTP attacks which are a cover channel for deadly (accept, read, write) timeout attacks which easily disrupt Apache and IIS servers.
  3. They ignore the robots.txt https://en.wikipedia.org/wiki/Robots.txt exclusion standard, which specifies pages that should not be accessed by robots.^[/citation needed https://en.wikipedia.org/wiki/Wikipedia:Citation_needed/]
  4. They use a falsified user-agent https://en.wikipedia.org/wiki/User-agent string, usually pretending to be some version of Microsoft https://en.wikipedia.org/wiki/Microsoft Internet Explorer https://en.wikipedia.org/wiki/Internet_Explorer on some version of Windows https://en.wikipedia.org/wiki/Microsoft_Windows, which is deceptive and can throw off log analysis. (Interestingly, this is one way to identify the crawler, as it often lists 'Windows XP' in the user-agent. A real Windows XP system actually identifies itself as 'Windows NT 5.1'. This method should not be depended on for positive identification, however, as Cyveillance has been known to change its user-agent strings from time to time. It actually has changed it to "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)", and "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" has also been seen.) Below is a sample of an actual Apache HTTP Server https://en.wikipedia.org/wiki/Apache_HTTP_Server log file sample showing IP address that belongs to Cyveillance, and faked User-Agent browser identification string:

38.100.21.65 - -[05/Jan/2013:17:31:19 -0500] "GET / HTTP/1.1" 200 6163 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)" 38.100.21.65 - -[05/Jan/2013:17:31:19 -0500] "GET /styles.css HTTP/1.1" 200 5092 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)"

  1. The company does not always respond to cease and desist letters.^[/citation needed https://en.wikipedia.org/wiki/Wikipedia:Citation_needed/]
  2. Because they falsify their string agent and otherwise obscure their identity, (they may also appear in weblogs as PSINet), Individuals may not be aware of the existence of Cyveillance and the data its collects and reports to the Secret Service.^[2] https://en.wikipedia.org/wiki/Cyveillance#cite_note-dhs.gov-2

On 2016/07/12 7:16 AM, Marius Voila wrote:

Yes, |globalblacklist.conf| is a new contribution of @mitchellkrogza https://github.com/mitchellkrogza. As far as I understood from him is using this for its own servers and I hope that this weekend I will incorporate it to |blacklist.conf| .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mariusv/nginx-badbot-blocker/issues/7#issuecomment-231939501, or mute the thread https://github.com/notifications/unsubscribe/AJgARfdc7By25d26lD1Vtj2Z_9LwzFLVks5qUyMegaJpZM4JH8xK.

mariusv commented 7 years ago

I will close this as the question has been answered If you feel like you didn't got the answer please feel free to re-open it.

Thank you!