rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
https://rom1504.github.io/clip-retrieval/
MIT License
2.41k stars 211 forks source link

Implement a pytorch dataloader that filter and download at run time #39

Open rom1504 opened 3 years ago

rom1504 commented 3 years ago

this is an online version of https://github.com/rom1504/clip-retrieval/issues/31 Combine the whole pipeline not as a big batch job, but instead as a data loader that

It makes sense in particular when the model training speed is low. For example dalle is such a model. For clip it could make less sense

it could be a lot more convenient than downloading TB of webdataset if it works:

  1. download a 16GB knn index and 50GB of metadata
  2. write your best keyword and how much of each you'd like (with clip thresholds)
  3. start the training on up to 400M sample
rom1504 commented 3 years ago

related https://github.com/rom1504/img2dataset/issues/56

I'm thinking of implementing the download+resize inside img2dataset since these features are already there. I think to pass it to pytorch a good way would be to add a writer to img2dataset that would take as attribute a multiprocessing queue https://docs.python.org/3/library/multiprocessing.html#pipes-and-queues and then to use that queue in an iterable dataset https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset Since queue is process and thread safe, it would work from the producer part (img2dataset produce from multiple processes), and from the consumer part (torch dataloader could apply any resizing/batching in different processes)

img2dataset would not need to depend on pytorch since implementing an iterable dataset only requires having a class with an __iter__ method

rom1504 commented 3 years ago

the filtering / retrieving from an index part would however make more sense to live here, so clip-retrieval could depend on img2dataset and use its UrlStreamingDataset to provide a FilteredUrlStreamingDataset

let's hope this can be made to work with the same speed at img2dataset (1300 sample/s)

rom1504 commented 2 years ago

https://github.com/rom1504/img2dataset/issues/82

Could be interesting to investigate this path

  1. img2dataset is a (multi instance per machine) rest service that takes as input a path towards an url shard, and return a path towards an image shard when it's done
  2. clip inference is a (multiple instance per machine) rest service that takes as input a path towards an image shard and return a path towards an embedding shard when it's done
  3. autofaiss is a (multiple instance per machine) rest service that takes as input a path towards an embedding file and return a path towards an index path when it's done

The img2dataset service can also expose a shard endpoint that takes as input some url, caption files and turn them into shard files. The autofaiss service can also expose a train endpoint and a merge endpoint. The clip inference service can also expose a combine endpoint to turn N embeddings file into one

Then all that is needed will be an orchestrator with a metadata database, that makes sure all the shards are properly done.

Benefits:

To check:

rom1504 commented 2 years ago

new idea: rethink all these tools as dataflow/stream/transformers taking an input a collection and producing an output collection with optional caching and pressure push back

reader:

writer:

transformer:

These bricks could then be naturally composed to form downloaders, inferences and indexers

defining good interfaces for each subtool then making each tool a separate package, well tested and with good examples

Check if https://docarray.jina.ai/fundamentals/documentarray/ could be helpful to build this

This new structure should make it possible to make all these tools both more powerful and more reusable

rom1504 commented 2 years ago

related https://github.com/webdataset/webdataset/blob/main/notebooks/openimages.ipynb

rom1504 commented 2 years ago

let's first try and check how to read in parallel a large file with fsspec

rom1504 commented 2 years ago

reading a large file with fsspec works by seeking and reading up to a length, it's much faster

rom1504 commented 2 years ago

next step will be implementing a clean embedding-reader package

rom1504 commented 2 years ago

independently I think that https://towardsdatascience.com/data-pipelines-with-apache-beam-86cd8eb55fd8 looks good