huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.05k stars 2.64k forks source link

Arrow file is too large when saving vector data #1662

Closed weiwangorg closed 3 years ago

weiwangorg commented 3 years ago

I computed the sentence embedding of each sentence of bookcorpus data using bert base and saved them to disk. I used 20M sentences and the obtained arrow file is about 59GB while the original text file is only about 1.3GB. Are there any ways to reduce the size of the arrow file?

lhoestq commented 3 years ago

Hi ! The arrow file size is due to the embeddings. Indeed if they're stored as float32 then the total size of the embeddings is

20 000 000 vectors 768 dimensions 4 bytes per dimension ~= 60GB

If you want to reduce the size you can consider using quantization for example, or maybe using dimension reduction techniques.

weiwangorg commented 3 years ago

Thanks for your reply @lhoestq. I want to save original embedding for these sentences for subsequent calculations. So does arrow have a way to save in a compressed format to reduce the size of the file?

lhoestq commented 3 years ago

Arrow doesn't have compression since it is designed to have no serialization overhead

weiwangorg commented 3 years ago

I see. Thank you.