Provide or link to efficient ways to read the dataset

rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

MIT License

3.75k stars 341 forks source link

Provide or link to efficient ways to read the dataset #32

Open rom1504 opened 3 years ago

rom1504 commented 3 years ago

Example for (distributed) training

Hugging face dataset
tf dataset
kaggle
webdataset / pytorch
Jax example
keras example

Example for (distributed) inference:

Clip batch
efficient b0 inference

Example for statistics computation:

pyspark to compute stats
pyspark to compute incremental input
dask / pandas

Most of those should be very short .py that do not add any dependency to the repo in an examples folder. Some of them will be links to to other places.

rom1504 commented 3 years ago

https://webdataset.github.io/webdataset/writing/#writing-filters-and-offline-augmentation

https://github.com/webdataset/webdataset#webdataset

rom1504 commented 3 years ago

pip install webdataset pyyaml

from tqdm import tqdm
import torch
import webdataset as wds
dataset = wds.WebDataset([f"http://the-eye.eu/eleuther_staging/cah/releases/laion400m/{i:05d}.tar" for i in range(16)], cache_dir='/tmp/mycache')
dataset = dataset.map(lambda a: (None if "txt" not in a else a["txt"], a["jpg"]))
dataloader = torch.utils.data.DataLoader(dataset, num_workers=16, batch_size=64)
for _ in tqdm(iter(dataloader)):
    pass

3000 sample/s before caching 35000 sample/s after caching the transformation is necessary because a few rare images don't have captions

rom1504 commented 3 years ago

https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_inference.py#L105 is a good way to read the dataset, add it here as an example

rom1504 commented 2 years ago

having a package dedicated to reading datasets in various format could be an idea