Implement a pytorch dataloader that filter and download at run time

rom1504 commented 3 years ago

this is an online version of https://github.com/rom1504/clip-retrieval/issues/31 Combine the whole pipeline not as a big batch job, but instead as a data loader that

query/filter in a knn index + metadata structure
download
resize
give to training

It makes sense in particular when the model training speed is low. For example dalle is such a model. For clip it could make less sense

it could be a lot more convenient than downloading TB of webdataset if it works:

download a 16GB knn index and 50GB of metadata
write your best keyword and how much of each you'd like (with clip thresholds)
start the training on up to 400M sample

rom1504 commented 3 years ago

I'm thinking of implementing the download+resize inside img2dataset since these features are already there. I think to pass it to pytorch a good way would be to add a writer to img2dataset that would take as attribute a multiprocessing queue https://docs.python.org/3/library/multiprocessing.html#pipes-and-queues and then to use that queue in an iterable dataset https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset Since queue is process and thread safe, it would work from the producer part (img2dataset produce from multiple processes), and from the consumer part (torch dataloader could apply any resizing/batching in different processes)

img2dataset would not need to depend on pytorch since implementing an iterable dataset only requires having a class with an __iter__ method

rom1504 commented 3 years ago

the filtering / retrieving from an index part would however make more sense to live here, so clip-retrieval could depend on img2dataset and use its UrlStreamingDataset to provide a FilteredUrlStreamingDataset

let's hope this can be made to work with the same speed at img2dataset (1300 sample/s)

rom1504 commented 2 years ago

https://github.com/rom1504/img2dataset/issues/82

Could be interesting to investigate this path

img2dataset is a (multi instance per machine) rest service that takes as input a path towards an url shard, and return a path towards an image shard when it's done
clip inference is a (multiple instance per machine) rest service that takes as input a path towards an image shard and return a path towards an embedding shard when it's done
autofaiss is a (multiple instance per machine) rest service that takes as input a path towards an embedding file and return a path towards an index path when it's done

The img2dataset service can also expose a shard endpoint that takes as input some url, caption files and turn them into shard files. The autofaiss service can also expose a train endpoint and a merge endpoint. The clip inference service can also expose a combine endpoint to turn N embeddings file into one

Then all that is needed will be an orchestrator with a metadata database, that makes sure all the shards are properly done.

Benefits:

easy separation of concern
easy deployment of the services
easy scaling
easy to combine various features
provide both streaming and batch modes with one implementation
possible to use it only to get a few shards
simpler to test
logic in each service is limited, no need to redo the orchestration every time

To check:

how to plug that kind of thing into a spark job (run services in background task in each executor ?) if needed https://medium.com/geekculture/how-to-execute-a-rest-api-call-on-apache-spark-the-right-way-in-python-4367f2740e78 https://stackoverflow.com/questions/59216604/how-to-call-a-web-service-called-from-a-spark-job -> probably doesn't make a lot of sense
how to start/kill such services when you run some training code

rom1504 commented 2 years ago

new idea: rethink all these tools as dataflow/stream/transformers taking an input a collection and producing an output collection with optional caching and pressure push back

reader:

url/meta in parquet, csv,.. -> shards of url/meta
images in files, tar, parquet -> shards of image/meta
embeddings in npy, parquet -> shards of embeddings
indices in .index -> shards of indices

writer:

shards of url/meta -> url/meta in parquet, csv, ..
shards of image/meta -> images in files, tar, parquet
shards of embeddings -> embeddings in npy, parquet
shards of indices -> indices in .index

transformer:

shard of url/meta -> shards of image/meta
shards of image/meta -> shards of embeddings / meta
shards of embeddings / meta -> shards of indices

These bricks could then be naturally composed to form downloaders, inferences and indexers

defining good interfaces for each subtool then making each tool a separate package, well tested and with good examples

Check if https://docarray.jina.ai/fundamentals/documentarray/ could be helpful to build this

This new structure should make it possible to make all these tools both more powerful and more reusable

rom1504 commented 2 years ago

let's first try and check how to read in parallel a large file with fsspec

rom1504 commented 2 years ago

reading a large file with fsspec works by seeking and reading up to a length, it's much faster

rom1504 commented 2 years ago

next step will be implementing a clean embedding-reader package

rom1504 commented 2 years ago

independently I think that https://towardsdatascience.com/data-pipelines-with-apache-beam-86cd8eb55fd8 looks good

rom1504 / clip-retrieval

Implement a pytorch dataloader that filter and download at run time #39