kssteven418 / I-BERT

[ICML'21 Oral] I-BERT: Integer-only BERT Quantization
https://arxiv.org/abs/2101.01321
MIT License
226 stars 32 forks source link

Storing both float32 and int parameters #22

Open huu4ontocord opened 2 years ago

huu4ontocord commented 2 years ago

Hi

It looks like at least in the HF code, you are storing both the float32 AND the int weights, which would increase the memory footprint. Don't you want to either load one or the other, or at least have an option to quanitize and send to cuda or something like that, where you would clear the float32 version or int version and send to cuda, thus lowering the memory footprint. Alternately you could overload the 'to' (or 'cuda'? or whatever method is used to convert to cuda) to only move over only the right parameters?

Thanks