rom1504 / embedding-reader

Efficiently read embedding in streaming from any filesystem
MIT License
92 stars 19 forks source link

[question] Multiple vs single parquet/np files for embeddings #38

Closed apsdehal closed 7 months ago

apsdehal commented 1 year ago

In practice, when we are processing embeddings at billion scale, does it matter if we have multiple parquet/np files vs single?

This SO link has some good comments, but I wanted to get your opinion.

rom1504 commented 7 months ago

it depends on your file system. If it supports range requests then this is equivalent as you can get in parallel multiple pieces of a single file. For parquet file, it's less true as you can only split at the border of groups