Open jelmervdl opened 1 year ago
Also, there's not really a good reason to use cc-download.py
as this is basically already sufficient:
HOST=http://data.commoncrawl.org/
CRAWL=CC-MAIN-2022-49
curl -L "$HOST/crawl-data/$CRAWL/warc.paths.gz" | gzip -cd | parallel wget -cq "$HOST/{}"
I'm not sure whether Python does a DNS resolve for every
urlopen
call or not. I noticed thatdata.commoncrawl.org
returns multiple IPs, so we could spread the load over multiple cloudfront servers.If yes maybe Python can be a bit more efficient in caching the DNS results.
If no, maybe we can hook into the resolver to return a randomly selected IP of the returned IPs each time.