Open rom1504 opened 3 years ago
related https://github.com/rom1504/img2dataset/issues/56
I'm thinking of implementing the download+resize inside img2dataset since these features are already there. I think to pass it to pytorch a good way would be to add a writer to img2dataset that would take as attribute a multiprocessing queue https://docs.python.org/3/library/multiprocessing.html#pipes-and-queues and then to use that queue in an iterable dataset https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset Since queue is process and thread safe, it would work from the producer part (img2dataset produce from multiple processes), and from the consumer part (torch dataloader could apply any resizing/batching in different processes)
img2dataset would not need to depend on pytorch since implementing an iterable dataset only requires having a class with an __iter__
method
the filtering / retrieving from an index part would however make more sense to live here, so clip-retrieval could depend on img2dataset and use its UrlStreamingDataset to provide a FilteredUrlStreamingDataset
let's hope this can be made to work with the same speed at img2dataset (1300 sample/s)
https://github.com/rom1504/img2dataset/issues/82
Could be interesting to investigate this path
The img2dataset service can also expose a shard endpoint that takes as input some url, caption files and turn them into shard files. The autofaiss service can also expose a train endpoint and a merge endpoint. The clip inference service can also expose a combine endpoint to turn N embeddings file into one
Then all that is needed will be an orchestrator with a metadata database, that makes sure all the shards are properly done.
Benefits:
To check:
how to plug that kind of thing into a spark job (run services in background task in each executor ?) if needed https://medium.com/geekculture/how-to-execute-a-rest-api-call-on-apache-spark-the-right-way-in-python-4367f2740e78 https://stackoverflow.com/questions/59216604/how-to-call-a-web-service-called-from-a-spark-job -> probably doesn't make a lot of sense
how to start/kill such services when you run some training code
new idea: rethink all these tools as dataflow/stream/transformers taking an input a collection and producing an output collection with optional caching and pressure push back
reader:
writer:
transformer:
These bricks could then be naturally composed to form downloaders, inferences and indexers
defining good interfaces for each subtool then making each tool a separate package, well tested and with good examples
Check if https://docarray.jina.ai/fundamentals/documentarray/ could be helpful to build this
This new structure should make it possible to make all these tools both more powerful and more reusable
let's first try and check how to read in parallel a large file with fsspec
reading a large file with fsspec works by seeking and reading up to a length, it's much faster
next step will be implementing a clean embedding-reader package
independently I think that https://towardsdatascience.com/data-pipelines-with-apache-beam-86cd8eb55fd8 looks good
this is an online version of https://github.com/rom1504/clip-retrieval/issues/31 Combine the whole pipeline not as a big batch job, but instead as a data loader that
It makes sense in particular when the model training speed is low. For example dalle is such a model. For clip it could make less sense
it could be a lot more convenient than downloading TB of webdataset if it works: