Open rom1504 opened 3 years ago
pip install webdataset pyyaml
from tqdm import tqdm
import torch
import webdataset as wds
dataset = wds.WebDataset([f"http://the-eye.eu/eleuther_staging/cah/releases/laion400m/{i:05d}.tar" for i in range(16)], cache_dir='/tmp/mycache')
dataset = dataset.map(lambda a: (None if "txt" not in a else a["txt"], a["jpg"]))
dataloader = torch.utils.data.DataLoader(dataset, num_workers=16, batch_size=64)
for _ in tqdm(iter(dataloader)):
pass
3000 sample/s before caching 35000 sample/s after caching the transformation is necessary because a few rare images don't have captions
https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_inference.py#L105 is a good way to read the dataset, add it here as an example
having a package dedicated to reading datasets in various format could be an idea
Example for (distributed) training
Example for (distributed) inference:
Example for statistics computation:
Most of those should be very short .py that do not add any dependency to the repo in an examples folder. Some of them will be links to to other places.