rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
https://rom1504.github.io/clip-retrieval/
MIT License
2.42k stars 213 forks source link

Add warning about input_shard_per_output_shard performance #187

Open rom1504 opened 2 years ago

rom1504 commented 2 years ago

For good performance current implementation requires

input_shard_per_output_shard >= num_prepro_workers

For example: num_prepro_workers = 8 sample_per_output_shard = 1000000 sample_per_input_shard = 10000 input_shard_per_output_shard = sample_per_output_shard / sample_per_input_shard input_shard_per_output_shard = 100

In that case the speed will be optimal only when 100 shards are still left to be done. That means for datasets smaller than 100 shards it will be slow, and if the dataset is 100 shards, it will be fast initially then gets slower and slower.

Action item:

rom1504 commented 2 years ago

an option to consider may be to introduce the concept of tasks that contains multiple output shard and hence can keep reading the same input shards. It would solve this problem