Investigate what format to use to store embeddings+id

rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them

MIT License

2.41k stars 211 forks source link

Current format :

Numpy+parquet : Benefit:

Drawback:

Ordered collections are distribution friendly. Even though it is possible to keep ordering when doing distributed processing, it makes things significantly more complex

Parquet with embeddings : Benefit:

Drawback:

Embeddings in parquet are represented as an variable length array, this is not efficient
slow to read
It doesn't make sense to use a columnar format to read data sequentially with all columns

What alternative exist to store embeddings+id ?

rom1504 / clip-retrieval