Dataloader feature : make it possible to run training/inference in streaming - Githubissues

rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

MIT License

3.63k stars 336 forks source link

Dataloader feature : make it possible to run training/inference in streaming #82

Open rom1504 opened 2 years ago

rom1504 commented 2 years ago

Useful to use img2dataset for inference directly without saving to disk

rom1504 commented 2 years ago

using fsspec to provide an in memory file system + doing small shards could be a good way + yielding

rom1504 commented 2 years ago

the final goal of this is to provide an img2dataset-based dataloader

rom1504 commented 2 years ago

https://github.com/rom1504/img2dataset/issues/93#issuecomment-1008088388 could be a good solution here too https://docs.python.org/3/library/multiprocessing.html put the driver in a subprocess Then in the main process, you check shards that are done and give that to the user can also automatically delete old shards under an option

That means saving things to disk and not doing memory only but I think it's a decent option because resized images are not that big later-on, this can be expanded to work with a memory mounted disk as well.

rom1504 commented 2 years ago

I quite like this idea because it will work for all distributors

That should make it possible for the main process to become a generator

A generator of samples for local training Or of file names for distributed training/inference

A pyspark source can be built on top of batches of files. Maybe pyspark streaming is also a good way to build a source from a stream of files

rom1504 commented 2 years ago

https://pytorch.org/docs/stable/notes/multiprocessing.html something based on Queue could work for the local distributor check how to do it

rom1504 commented 2 years ago

https://stackoverflow.com/questions/43078980/python-multiprocessing-with-generator

rom1504 commented 2 years ago

interface:

from img2dataset import download
dataset = download(output_format="stream")
for sample in dataset:
  print(sample)
  break

main idea is to use a queue. Could be a local in memory queue or a queue reader (read from the file system for example) , or even kafka implement the local queue first but keep the possibility open for better queues

open question: how to handle further preprocessing / batching? check how webdataset does it

rom1504 commented 2 years ago

making this work similarly to aistore could be good :

img2dataset starts as a service, exposes a rest api for control and information
it pushes its content to disk and/or to a queue
another process reads the queue and use the data the data in the queue could be jpg or tensors

rom1504 commented 2 years ago

https://github.com/webdataset/webdataset/blob/05a1ea1116781ffe3c3bc257061f2f3e51dfeb0b/webdataset/multi.py#L54

rom1504 commented 2 years ago

new idea: use /dev/shm

rom1504 commented 2 years ago

https://gist.github.com/borzunov/b9f6f0d3cea5930951892b53879dd029

rom1504 commented 2 years ago

https://gist.github.com/borzunov/5f493e3c18bfa90d4de0530eb214a250

rom1504 commented 2 years ago

this is still the most prioritary thing

if this was possible it would unlock:

HF loader packaging (read from laion5B in streaming with little resources)
same but with knn index pre-selection

robvanvolt commented 1 year ago

https://gist.github.com/borzunov/5f493e3c18bfa90d4de0530eb214a250

AttributeError: Can't pickle local object '_generate_examples_from_tables_wrapper..wrapper'

robvanvolt commented 1 year ago

new idea: use /dev/shm

@rom1504 This only works on linux, doesn't it? But it looks good for this usecase... want to try that path anyway and make it linux-only?

edit.: I wonder if it was not better to directly write a downloader for "in-memory" streaming for huggingface datasets, like https://huggingface.co/datasets/laion/laion400m? Or rather create a new repository "dataset2stream" and only reuse the loader for example, as they already provide a streaming function for regular datasets (https://huggingface.co/docs/datasets/v1.16.1/stream.html)? That's basically the link you provided (https://gist.github.com/borzunov/5f493e3c18bfa90d4de0530eb214a250) with small fixes. But optimizing download speeds would again favor img2dataset again... Will try to get the code from borzunov running and then will see:)

rom1504 commented 1 year ago

I think it would be useful to put it in this repo once we get something that works, so we benefit from all the options implemented here

rom1504 commented 1 year ago

def curious if you get anything working, I have not looked at this issue for a while

robvanvolt commented 1 year ago

So after a little discussion with rom, we should split the "big task of a running streamable version of img2dataest as a dataloader / iterabledataset / ..." into smaller subtasks:

Required features:

[ ] Pause the producer on demand by adding entry point in the producer, e.g., an http api
[ ] Consumer needs to read the queue with done things and delete what's done

Notes: read only complete shards

First task:

Set up a DataLoader that can read and delete files
Implement it into the working directory of img2dataset