mobiusml / hqq

Official implementation of Half-Quadratic Quantization (HQQ)
https://mobiusml.github.io/hqq_blog/
Apache License 2.0
658 stars 64 forks source link

Weight Sharding #100

Closed winglian closed 2 weeks ago

winglian commented 1 month ago

I'm trying to quantize 405b, but then I'm unable to upload it to HF since it's ~200GB and HF LFS has a limit on 50GB file sizes. Is there a correct way to shard the model file so it can be loaded again with AutoHQQ?

mobicham commented 1 month ago

There's a pull request for sharded safetensors serialization on-going: https://github.com/huggingface/transformers/pull/32379 Once this is fixed, it's gonna be possible to save hqq-quantized models directly via model.save_pretrained as sharded safetensors

mobicham commented 2 weeks ago

Closing this since we are very close to full transformers serialization support here: https://github.com/huggingface/transformers/pull/33141