Open rom1504 opened 2 years ago
this is pretty much the highest priority if in-built dns support can be done, then it will make this tool work much better by default
Prefetching could likely be done if taking into account round robin properly, ie saving all the A records and doing round robin over them The options for prefetching are either to prefetch everything first either to do it per batch Also possible not to prefetch Regardless if done properly this could speed up a lot downloading, especially in places with poor dns setup
wget http://3080.rom1504.fr/cah/domain_laion400m.parquet
from berserker_resolver import Resolver
from itertools import islice, chain
from multiprocessing import Pool
from tqdm import tqdm
import pandas as pd
import sys
import os
resolver = Resolver()
resolver.tries=3
domains = pd.read_parquet("domain_laion400m.parquet")
all_domains = list(domains["domain"])
def batcher(iterable, batch_size):
iterator = iter(iterable)
for first in iterator:
yield list(chain([first], islice(iterator, batch_size - 1)))
batches = batcher(all_domains,10000)
def f(a):
try:
resolver.resolve(a)
except Exception as _:
pass
def mute():
sys.stdout = open(os.devnull, 'w')
sys.stderr = open(os.devnull, 'w')
with Pool(16, maxtasksperchild=5, initializer = mute) as process_pool:
for _ in tqdm(process_pool.imap_unordered(f, batches),):
pass
implemented at https://github.com/rom1504/img2dataset/tree/dns not working well, lot of dns failures (mismatched ips)
ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE
better by disabling ssl verification
preresolving with massdns is not working well it's also not increasing speed that much
maybe connection pool + retrying would be better, see #101
probably #42 will be the only thing to do here
optionally
related https://github.com/rom1504/img2dataset/issues/42
kind of needed to avoid people having issues if they don't set up a good dns (not that obvious to do depending on env)