rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.5k stars 330 forks source link

figure out how to build in a dns system #98

Open rom1504 opened 2 years ago

rom1504 commented 2 years ago

optionally

related https://github.com/rom1504/img2dataset/issues/42

kind of needed to avoid people having issues if they don't set up a good dns (not that obvious to do depending on env)

rom1504 commented 2 years ago

this is pretty much the highest priority if in-built dns support can be done, then it will make this tool work much better by default

rom1504 commented 2 years ago

Prefetching could likely be done if taking into account round robin properly, ie saving all the A records and doing round robin over them The options for prefetching are either to prefetch everything first either to do it per batch Also possible not to prefetch Regardless if done properly this could speed up a lot downloading, especially in places with poor dns setup

rom1504 commented 2 years ago

https://stackoverflow.com/a/60751327/1658314

rom1504 commented 2 years ago

https://github.com/rthalley/dnspython

rom1504 commented 2 years ago

https://github.com/DmitryFillo/berserker_resolver

rom1504 commented 2 years ago

wget http://3080.rom1504.fr/cah/domain_laion400m.parquet

from berserker_resolver import Resolver
from itertools import islice, chain
from multiprocessing import Pool
from tqdm import tqdm
import pandas as pd 
import sys
import os

resolver = Resolver()                    
resolver.tries=3
domains = pd.read_parquet("domain_laion400m.parquet")        
all_domains = list(domains["domain"])                                                                                     
def batcher(iterable, batch_size):
    iterator = iter(iterable)
    for first in iterator:
        yield list(chain([first], islice(iterator, batch_size - 1)))           

batches = batcher(all_domains,10000)
def f(a):
   try:
     resolver.resolve(a)
   except Exception as _:
      pass
def mute():
   sys.stdout = open(os.devnull, 'w')
   sys.stderr = open(os.devnull, 'w')

with Pool(16, maxtasksperchild=5, initializer = mute) as process_pool:
    for _ in tqdm(process_pool.imap_unordered(f, batches),):
        pass
rom1504 commented 2 years ago

https://github.com/blechschmidt/massdns http://3080.rom1504.fr/cah/domain_laion400m.txt

rom1504 commented 2 years ago

implemented at https://github.com/rom1504/img2dataset/tree/dns not working well, lot of dns failures (mismatched ips)

rom1504 commented 2 years ago

ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE

better by disabling ssl verification

rom1504 commented 2 years ago

preresolving with massdns is not working well it's also not increasing speed that much

maybe connection pool + retrying would be better, see #101

rom1504 commented 2 years ago

probably #42 will be the only thing to do here