transformers 4.40.X has now difficulty saving post one-shot models. This applies to models that are 1) quantized using new (not vLLM) QuantizationModifiers 2) in the "fakequant" state.
HF developers made changes to the internal logic of parsing "tied weights" (such as embedding and lm_head modules) on save_pretrained. Those decisions are buried quite deep in the transformer's codebase. My solution is to untie the weights for the one-shot models on initialization. Benefits: It requires minimal changes on our side, and does not influence the size of saved models on disk (safe-tensors untie weights as well). Downsides: The one-shot process might be slightly less performant because of doubling the memory required to store the embedding layer in CUDA.
We need to update the transformers version to support QWEN2-MOE model, see: https://github.com/huggingface/transformers/releases/tag/v4.40.0 (it also, fits into our goal to be constantly matching the latest release)
Important changes
By default, untie the vocab embedding weights
transformers
4.40.X
has now difficulty saving post one-shot models. This applies to models that are 1) quantized using new (not vLLM) QuantizationModifiers 2) in the "fakequant" state. HF developers made changes to the internal logic of parsing "tied weights" (such as embedding and lm_head modules) onsave_pretrained
. Those decisions are buried quite deep in the transformer's codebase. My solution is to untie the weights for the one-shot models on initialization. Benefits: It requires minimal changes on our side, and does not influence the size of saved models on disk (safe-tensors untie weights as well). Downsides: The one-shot process might be slightly less performant because of doubling the memory required to store the embedding layer in CUDA.