ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
10 stars 7 forks source link

Add faster/parallel queues for known CDNs #79

Open anjackson opened 2 years ago

anjackson commented 2 years ago

The 2021 Domain Crawl missed quite a lot of items because it treats CDNs like normal hosts and is far too 'polite', which means we never get caught up. We should add a sheet to make them go faster, but this needs a bit of research to see how fast it is safe for us to go.

Known CDNs include (this is just from scanning the sample of 2000 retired queues from DC 2021 that the Frontier Report shows. There were many more sites that hit the cap.

com,shopify,cdn,
com,wixstatic,static,
com,squarespace-cdn,images,
com,amazonaws,s3,primarysite-prod-sorted,
com,bigcommerce,cdn11,
com,squarespace,static1
com,wp,i0,
com,wp,i1,
com,wp,i2,
jp,imgz,c,
me,rocketcdn,
com,rs-cdn,uk, 
uk,co,sykesassets,property-images-cdn, 
cymru,cyfoethnaturiol,cdn, 
net,ekm,cdn,
com,packhelp,cdn,static,
com,lw-cdn, 
io,statically,cdn, 
net,b-cdn,
com,rackcdn,
com,rackcdn,cf3,ssl,24a04536d882ca0087a3-289132c7eabba70668e526ce8cd83a46, [???]
com,myportfolio,pro2-bar-s3-cdn-cf4, [???]
com,smushcdn,664305,
com,productserve,images2, 
com,stackpathcdn,
uk,co,foodism,cdn,
io,accentuate,cdn, 
uk,co,love4lighting,cdn, 
net,lightgalleries,cdn,
com,jimcdn,image, 
com,tildacdn,static, 
uk,co,ednology,marketplace,cdn, 
uk,co,bargainmax,cdn,
com,tripadvisor,dynamic-media-cdn, 
uk,co,express,images,cdn, 
net,sz-cdn,uk, 
com,shgcdn,i, 
com,schooljotter2,cdn,img2, 
com,uenicdn,img77, 
com,ucarecdn, [??? brings in full site?]
com,dvipcdn,f, 
com,kajabi-cdn,kajabi-storefronts-production, 
com,sqspcdn,1,static1, 
net,website-editor,le-cdn, 
com,aiircdn,mmo, 
com,schooljotter2,cdn,img, 
events,asp,cdn, 
com,cdn-website,irp,
com,cdn-website,lirp,
net,nccdn,0501,
net,create-cdn,sites, 
com,simplesite,cdn, 
net,secureservercdn, 
uk,co,atcdn,m, 
com,googleapis,storage,
com,editmysite,cdn2, 
com,multiscreensite,lirp-cdn, 
com,amazonaws,s3-eu-west-1 [???]
https://s3-eu-west-1.amazonaws.com/cdn.webfactore.co.uk/sr_274624.png?1537558108
uk,co,tropicalsky,cdn1, 
uk,co,tropicalsky,cdn2, [???] 
uk,co,memiah,cdn, 

And from Slack (not sure if they want tagging here) "not a CDN, but I need to special-case domains like doi.org (and variants dx.doi.org etc) for scholarly crawling", so:

org,doi,
anjackson commented 2 years ago

Also perhaps it's possible to spot CDNs from IPs/reverse-DNS/response headers. The https://github.com/nicjansma/cdn-detector.js/ project indicates this can work, but also looks like a pain to keep up to date (I think it's Fastly rule may already be broken).

Some (inc. Fastly) declare an X-CDN: header, but it's not clear how many. Just spotting e.g. [-,]+cdn[1234567890]*, in SURTs might be more accurate, as that's largely how I'm able to identify them from the Retired Queue report!