Numpy matrix + parquet for IDs (ordered collections)
parquet with embeddings+id
Numpy+parquet :
Benefit:
Fast to read numpy alone
fast to read parquet alone
Drawback:
Ordered collections are distribution friendly. Even though it is possible to keep ordering when doing distributed processing, it makes things significantly more complex
Parquet with embeddings :
Benefit:
No ordering needed
Easy to filter / use normal processing tools (parquet, beam, pandas)
Drawback:
Embeddings in parquet are represented as an variable length array, this is not efficient
slow to read
It doesn't make sense to use a columnar format to read data sequentially with all columns
Current format :
Numpy+parquet : Benefit:
Drawback:
Parquet with embeddings : Benefit:
Drawback:
What alternative exist to store embeddings+id ?