Open anjackson opened 2 years ago
Also perhaps it's possible to spot CDNs from IPs/reverse-DNS/response headers. The https://github.com/nicjansma/cdn-detector.js/ project indicates this can work, but also looks like a pain to keep up to date (I think it's Fastly rule may already be broken).
Some (inc. Fastly) declare an X-CDN:
header, but it's not clear how many. Just spotting e.g. [-,]+cdn[1234567890]*,
in SURTs might be more accurate, as that's largely how I'm able to identify them from the Retired Queue report!
The 2021 Domain Crawl missed quite a lot of items because it treats CDNs like normal hosts and is far too 'polite', which means we never get caught up. We should add a sheet to make them go faster, but this needs a bit of research to see how fast it is safe for us to go.
Known CDNs include (this is just from scanning the sample of 2000 retired queues from DC 2021 that the Frontier Report shows. There were many more sites that hit the cap.
And from Slack (not sure if they want tagging here) "not a CDN, but I need to special-case domains like doi.org (and variants dx.doi.org etc) for scholarly crawling", so: