neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Apache License 2.0
2.05k stars 144 forks source link

[MOE Quantization] Update transformers version to 4.40.0 #2268

Closed dbogunowicz closed 4 months ago

dbogunowicz commented 4 months ago

We need to update the transformers version to support QWEN2-MOE model, see: https://github.com/huggingface/transformers/releases/tag/v4.40.0 (it also, fits into our goal to be constantly matching the latest release)

Important changes

By default, untie the vocab embedding weights

transformers 4.40.X has now difficulty saving post one-shot models. This applies to models that are 1) quantized using new (not vLLM) QuantizationModifiers 2) in the "fakequant" state. HF developers made changes to the internal logic of parsing "tied weights" (such as embedding and lm_head modules) on save_pretrained. Those decisions are buried quite deep in the transformer's codebase. My solution is to untie the weights for the one-shot models on initialization. Benefits: It requires minimal changes on our side, and does not influence the size of saved models on disk (safe-tensors untie weights as well). Downsides: The one-shot process might be slightly less performant because of doubling the memory required to store the embedding layer in CUDA.

dbogunowicz commented 4 months ago

@mgoin the failure in export-tests looks transient, just fyi