Closed weiwangorg closed 3 years ago
I think it's expected because with bert-base
each token will be embedded as a 768-dimensional vector. So if an example has n tokens then the size of embedding will be n*768
and these are all 32-bits floating-point numbers.
Yes. I use datasets and I think this is a question about datasets, how to save vector data in a compressed format to reduce the size of the file. So I close this issue.
I computed the sentence embedding of each sentence of bookcorpus data using bert base and saved them to disk. I used 20M sentences and the obtained arrow file is about 59GB while the original text file is only about 1.3GB. Are there any ways to reduce the size of the arrow file?