hplt-project / ia-download

Internet archive downloader
2 stars 0 forks source link

Check whether all IPs returned for data.commoncrawl.org are used #1

Open jelmervdl opened 1 year ago

jelmervdl commented 1 year ago

I'm not sure whether Python does a DNS resolve for every urlopen call or not. I noticed that data.commoncrawl.org returns multiple IPs, so we could spread the load over multiple cloudfront servers.

;; ANSWER SECTION:
data.commoncrawl.org.   279 IN  CNAME   ds5q9oxwqwsfj.cloudfront.net.
ds5q9oxwqwsfj.cloudfront.net. 39 IN A   54.230.10.119
ds5q9oxwqwsfj.cloudfront.net. 39 IN A   54.230.10.41
ds5q9oxwqwsfj.cloudfront.net. 39 IN A   54.230.10.84
ds5q9oxwqwsfj.cloudfront.net. 39 IN A   54.230.10.28

If yes maybe Python can be a bit more efficient in caching the DNS results.

If no, maybe we can hook into the resolver to return a randomly selected IP of the returned IPs each time.

jelmervdl commented 1 year ago

Also, there's not really a good reason to use cc-download.py as this is basically already sufficient:

HOST=http://data.commoncrawl.org/
CRAWL=CC-MAIN-2022-49
curl -L "$HOST/crawl-data/$CRAWL/warc.paths.gz" | gzip -cd | parallel wget -cq "$HOST/{}"