Closed weiwangorg closed 3 years ago
Hi ! The arrow file size is due to the embeddings. Indeed if they're stored as float32 then the total size of the embeddings is
20 000 000 vectors 768 dimensions 4 bytes per dimension ~= 60GB
If you want to reduce the size you can consider using quantization for example, or maybe using dimension reduction techniques.
Thanks for your reply @lhoestq. I want to save original embedding for these sentences for subsequent calculations. So does arrow have a way to save in a compressed format to reduce the size of the file?
Arrow doesn't have compression since it is designed to have no serialization overhead
I see. Thank you.
I computed the sentence embedding of each sentence of bookcorpus data using bert base and saved them to disk. I used 20M sentences and the obtained arrow file is about 59GB while the original text file is only about 1.3GB. Are there any ways to reduce the size of the arrow file?