rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.7k stars 338 forks source link

Add preresolving feature #43

Open rom1504 opened 3 years ago

rom1504 commented 3 years ago

It was recently noticed that laion 400m only contains urls from 5M domains. The same is probably true for other datasets.

Pre-resolving the domains would decrease the charge on the dns process by a lot and increase downloading speed.

rom1504 commented 3 years ago

would need to handle dns load balancing properly

rom1504 commented 3 years ago

tried https://stackoverflow.com/a/15065711/1658314 with no much success

rom1504 commented 9 months ago

This approach seems to work very well:

  1. get unique domains
  2. configure knot resolver to have a very long TTL
  3. use dnsperf to fetch all the domains
  4. run img2dataset

dnsperf -f inet -s 10.80.97.250 -d /tmp/list.txt -l 3600 -S 10 -Q 1000 -q 100 2>&1 | grep -v Timeout | grep -v "maybe timed out"

https://github.com/DNS-OARC/dnsperf/blob/master/README.md

I think that is pretty promising and it may be interesting to try and put that directly in img2dataset (at least the dnsperf part)

rom1504 commented 9 months ago

https://knot-resolver.readthedocs.io/en/stable/daemon-scripting.html

rom1504 commented 9 months ago

https://knot-resolver.readthedocs.io/en/stable/modules-stats.html#built-in-statistics