Closed Oreolek closed 3 months ago
This is the same bot that is blocked in the robots.txt file under: FacebookBot
(and also facebookscraper
).
Can you post the IP addresses that bot is using so we can establish that they belong to Facebook/Meta and is not an impersonator using their User Agent to trick people.
I found many IPs from this block: 173.252.64.0/18. For example 173.252.83.23 is one of them, but instead of posting individual IPs, I gave you a full subnet.
Another one besides 173.252.64.0/18 would be: 66.220.144.0/20, and also: 69.171.224.0/19... to get you started. A single 19 subnet prefix can have more than 8000+ hosts in theory. 18 subnet prefix can even hold more than 16k IPs.
I got hits from:
2a03:2880::
66.220.149.0
173.252.83.0
173.252.107.0
69.171.249.0
57.141.0.0
I got hits from:
2a03:2880:: 66.220.149.0 173.252.83.0 173.252.107.0 69.171.249.0 57.141.0.0
These all resolve to Facebook so are legitimate. How much crawling are they doing can you post some log examples?
I found many IPs from this block: 173.252.64.0/18. For example 173.252.83.23 is one of them, but instead of posting individual IPs, I gave you a full subnet.
Another one besides 173.252.64.0/18 would be: 66.220.144.0/20, and also: 69.171.224.0/19... to get you started. A single 19 subnet prefix can have more than 8000+ hosts in theory. 18 subnet prefix can even hold more than 16k IPs.
These also are IP owned by Facebook --- what exactly were they crawling that they used up 50GB ????
For now you will have to add that user agent to your own custom include file with a value of 3 and reload nginx, this will block it outright. Unfortunately this can't be done for the thousands using this blocker as this would be a seriously breaking change
"~*(?:\b)facebookexternalhit(?:\b)" 3;
GitHubNginx Block Bad Bots, Spam Referrer Blocker, Vulnerability Scanners, User-Agents, Malware, Adware, Ransomware, Malicious Sites, with anti-DDOS, Wordpress Theme Detector Blocking and Fail2Ban Jail f...
I found many IPs from this block: 173.252.64.0/18. For example 173.252.83.23 is one of them, but instead of posting individual IPs, I gave you a full subnet. Another one besides 173.252.64.0/18 would be: 66.220.144.0/20, and also: 69.171.224.0/19... to get you started. A single 19 subnet prefix can have more than 8000+ hosts in theory. 18 subnet prefix can even hold more than 16k IPs.
These also are IP owned by Facebook --- what exactly were they crawling that they used up 50GB ????
I didn't say they used up to 50GB (@Oreolek did say that). But in my case that could well be the case.
Facebook is doing 400-800 requests per minute on my sites. Facebook is a very bad bot.
This has been happening for months. We have instances where 60-80% of page views are coming from facebookexternalhit. On https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/ they mention that they use a different user-agent for AI training, but it's very likely they use facebookexternalhit to avoid being blocked. The strange thing is that we observe multiple simultaneous requests even to the same URL, which doesn't make logical sense.
Meta for DevelopersThis page lists the User Agent (UA) strings that identify Meta’s most common web crawlers and what each of those crawlers are used for.
This has been happening for months. We have instances where 60-80% of page views are coming from facebookexternalhit. On https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/ they mention that they use a different user-agent for AI training, but it's very likely they use facebookexternalhit to avoid being blocked. The strange thing is that we observe multiple simultaneous requests even to the same URL, which doesn't make logical sense.
Yea. It's insane they misuse Facebook external hits bots, thinking we should believe all those requests are users simply posting a link on Facebook. And the then external link will "just" receive the title or some image for displaying that content in fb. Looking at the amount of requests, this is false and Facebook is really misbehaving here.
Unfortunately you have to block this on your own accord by adding it to your own custom include. This can't be blocked mainstream as it will cause breaking changes for many users.
"~*(?:\b)facebookexternalhit(?:\b)" 3;
A
facebookexternalhit
bot downloaded 50+ Gb off my personal site. The site is about 500M in size.Expected behavior
This bot is explicitly marked as "good" in
globalblacklist.conf
- it should not be. They ignore robots.txt and have no limits. Block or at least limit them.