Arrow file is too large when saving vector data

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

134.85k stars 26.97k forks source link

Arrow file is too large when saving vector data #9339

Closed weiwangorg closed 3 years ago

weiwangorg commented 3 years ago

I computed the sentence embedding of each sentence of bookcorpus data using bert base and saved them to disk. I used 20M sentences and the obtained arrow file is about 59GB while the original text file is only about 1.3GB. Are there any ways to reduce the size of the arrow file?

patil-suraj commented 3 years ago

I think it's expected because with bert-base each token will be embedded as a 768-dimensional vector. So if an example has n tokens then the size of embedding will be n*768 and these are all 32-bits floating-point numbers.

weiwangorg commented 3 years ago

Yes. I use datasets and I think this is a question about datasets, how to save vector data in a compressed format to reduce the size of the file. So I close this issue.