Open rom1504 opened 2 years ago
using fsspec to provide an in memory file system + doing small shards could be a good way + yielding
the final goal of this is to provide an img2dataset-based dataloader
https://github.com/rom1504/img2dataset/issues/93#issuecomment-1008088388 could be a good solution here too https://docs.python.org/3/library/multiprocessing.html put the driver in a subprocess Then in the main process, you check shards that are done and give that to the user can also automatically delete old shards under an option
That means saving things to disk and not doing memory only but I think it's a decent option because resized images are not that big later-on, this can be expanded to work with a memory mounted disk as well.
I quite like this idea because it will work for all distributors
That should make it possible for the main process to become a generator
A generator of samples for local training Or of file names for distributed training/inference
A pyspark source can be built on top of batches of files. Maybe pyspark streaming is also a good way to build a source from a stream of files
https://pytorch.org/docs/stable/notes/multiprocessing.html something based on Queue could work for the local distributor check how to do it
interface:
from img2dataset import download
dataset = download(output_format="stream")
for sample in dataset:
print(sample)
break
main idea is to use a queue. Could be a local in memory queue or a queue reader (read from the file system for example) , or even kafka implement the local queue first but keep the possibility open for better queues
open question: how to handle further preprocessing / batching? check how webdataset does it
making this work similarly to aistore could be good :
new idea: use /dev/shm
this is still the most prioritary thing
if this was possible it would unlock:
https://gist.github.com/borzunov/5f493e3c18bfa90d4de0530eb214a250
AttributeError: Can't pickle local object '_generate_examples_from_tables_wrapper..wrapper'
new idea: use /dev/shm
@rom1504 This only works on linux, doesn't it? But it looks good for this usecase... want to try that path anyway and make it linux-only?
edit.: I wonder if it was not better to directly write a downloader for "in-memory" streaming for huggingface datasets, like https://huggingface.co/datasets/laion/laion400m? Or rather create a new repository "dataset2stream" and only reuse the loader for example, as they already provide a streaming function for regular datasets (https://huggingface.co/docs/datasets/v1.16.1/stream.html)? That's basically the link you provided (https://gist.github.com/borzunov/5f493e3c18bfa90d4de0530eb214a250) with small fixes. But optimizing download speeds would again favor img2dataset again... Will try to get the code from borzunov running and then will see:)
I think it would be useful to put it in this repo once we get something that works, so we benefit from all the options implemented here
def curious if you get anything working, I have not looked at this issue for a while
So after a little discussion with rom, we should split the "big task of a running streamable version of img2dataest as a dataloader / iterabledataset / ..." into smaller subtasks:
Required features:
Notes: read only complete shards
First task:
Useful to use img2dataset for inference directly without saving to disk