rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.42k stars 322 forks source link

Provide or link to efficient ways to read the dataset #32

Open rom1504 opened 2 years ago

rom1504 commented 2 years ago

Example for (distributed) training

Example for (distributed) inference:

Example for statistics computation:

Most of those should be very short .py that do not add any dependency to the repo in an examples folder. Some of them will be links to to other places.

rom1504 commented 2 years ago

https://webdataset.github.io/webdataset/writing/#writing-filters-and-offline-augmentation

https://github.com/webdataset/webdataset#webdataset

rom1504 commented 2 years ago
pip install webdataset pyyaml
from tqdm import tqdm
import torch
import webdataset as wds
dataset = wds.WebDataset([f"http://the-eye.eu/eleuther_staging/cah/releases/laion400m/{i:05d}.tar" for i in range(16)], cache_dir='/tmp/mycache')
dataset = dataset.map(lambda a: (None if "txt" not in a else a["txt"], a["jpg"]))
dataloader = torch.utils.data.DataLoader(dataset, num_workers=16, batch_size=64)
for _ in tqdm(iter(dataloader)):
    pass

3000 sample/s before caching 35000 sample/s after caching the transformation is necessary because a few rare images don't have captions

rom1504 commented 2 years ago

https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_inference.py#L105 is a good way to read the dataset, add it here as an example

rom1504 commented 2 years ago

having a package dedicated to reading datasets in various format could be an idea