huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.86k stars 26.97k forks source link

Add quantization_config in AutoModelForCausalLM.from_config() #26901

Open ishaansharma opened 1 year ago

ishaansharma commented 1 year ago

Feature request

Add quantization_config feature to AutoModelForCausalLM from config . I am trying to pretrain a model from scratch and use bits and bytes so that It can be trained on less computation expensive machines. Below is my quantization config :

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

When I attempted to take the config of certain model from_pretrained function it failed and raised a Type Error mentioned below.

from transformers import AutoConfig, AutoModelForCausalLM
config = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_config(config,quantization_config=bnb_config, device_map={"":0})

The Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[23], line 7
      3 # Download configuration from huggingface.co and cache.
      5 configy = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
----> 7 modely = AutoModelForCausalLM.from_config(configy,quantization_config=bnb_config, device_map={"":0})

File ~/miniconda3/envs/ai/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:441, in _BaseAutoModelClass.from_config(cls, config, **kwargs)
    439 elif type(config) in cls._model_mapping.keys():
    440     model_class = _get_model_class(config, cls._model_mapping)
--> 441     return model_class._from_config(config, **kwargs)
    443 raise ValueError(
    444     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    445     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
    446 )

File ~/miniconda3/envs/ai/lib/python3.10/site-packages/transformers/modeling_utils.py:1192, in PreTrainedModel._from_config(cls, config, **kwargs)
   1190         model = cls(config, **kwargs)
   1191 else:
-> 1192     model = cls(config, **kwargs)
   1194 # restore default dtype if it was modified
   1195 if dtype_orig is not None:

TypeError: MistralForCausalLM.__init__() got an unexpected keyword argument 'quantization_config'

Motivation

I had tried a work around by saving the model from the loaded config details from the model and then load the same model with quantization config .

I believe this process could get fixed and we can enable/add quantization while loading the model from the config itself.

Your contribution

from transformers import AutoConfig, AutoModelForCausalLM
config = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_config(config)
model.save_pretrained(MODEL_NAME_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME_PATH, quantization_config=bnb_config, device_map={"":0})
ArthurZucker commented 1 year ago

WDYT @younesbelkada

younesbelkada commented 1 year ago

Hi @ishaansharma Thanks a lot for the proposal! I personally would not advocate to go for that route, the quantization schemes we support right now consists on post-trainign quantization, meaning the usecase is always

1- load pre-trained weights from the hub or locally 2- quantize the pre-trained weights

The API you propose is cool, but I am afraid will not be used in practice as from_config will load random weights to the model. Let me know if I misunderstood anything!

ishaansharma commented 1 year ago

Hi @ishaansharma Thanks a lot for the proposal! I personally would not advocate to go for that route, the quantization schemes we support right now consists on post-trainign quantization, meaning the usecase is always

1- load pre-trained weights from the hub or locally 2- quantize the pre-trained weights

The API you propose is cool, but I am afraid will not be used in practice as from_config will load random weights to the model. Let me know if I misunderstood anything!

  1. I wanted this feature because it will be very useful for pre-training from scratch from any large language model with huge parameters that usually cannot be done on small machines will very less computation cost .

  2. To pre-train any model from scratch and to build a language model on a totally new language , I don't think the loaded random weights from the config will cause any harm. as eventually weights will get updated with the training .

@younesbelkada , I just want that even the pre-training a model of any language from scratch using any LLM architecture can be done on any machine .

Let me know if this approach help .

Warm Regard.

younesbelkada commented 1 year ago

Thanks for getting back to me @ishaansharma !

I wanted this feature because it will be very useful for pre-training from scratch from any large language model with huge parameters that usually cannot be done on small machines will very less computation cost .

Since you cannot perform full fine-tuning when the model is quantized I think that this is technically not possible :/ This comment can also be applied on your thoughts here:

To pre-train any model from scratch and to build a language model on a totally new language , I don't think the loaded random weights from the config will cause any harm. as eventually weights will get updated with the training .

BramVanroy commented 8 months ago

I have a similar use case but I want to load huge models efficiently so I've been following this guide, which first loads the empty model from a config and then loads the state into the empty model. But I do not understand how we can add other parameters (like load_in_8bit) to this process - from_config does not support such kwargs and nor does load_checkpoint_and_dispatch. So is that simply not possible in this kind of workflow? How else would one efficiently and quickly load a model in 8 bit? @younesbelkada

janEbert commented 3 weeks ago

Hey I stumbled upon the same issue, would've liked to be able to supply a device_map to AutoModel.from_config. :)

LysandreJik commented 3 weeks ago

cc @SunMarc

SunMarc commented 3 weeks ago

Hey @janEbert , what would be the use case for loading the model with from_config and device_map ? A workaround is to save the model loaded with from_config then use from_pretrained to load it again.

If you want to quantize the model loaded with from_config, please read the points that younes shared above. Thanks !

janEbert commented 3 weeks ago

The use case is to have the model properly distributed automatically. The workaround does work but is extremely hacky and ugly, if I'm completely honest. :sweat_smile: Cheers for the suggestion, though!

SunMarc commented 3 weeks ago

The use case is to have the model properly distributed automatically

We recommend using device_map for inference but it might no be very useful on a model with random weights.

Nevertheless, the algorithm behind device_map requires us to have the loaded weights somewhere. When using from_config, we are initializing the weights from the model definition and not from a file that was stored on the hub. If you can load the entire model on the cpu, then what you can do is to use dispatch_model function to have the model distributed across your gpus.

from transformers import AutoConfig, AutoModelForCausalLM
from accelerate import dispatch_model, infer_auto_device_map

config = AutoConfig.from_pretrained("model")
model = AutoModelForCausalLM.from_config(config)

# infer device_map
device_map = infer_auto_device_map(model, no_split_module_classes = model._no_split_modules)

dispatch_model(model, device_map)

LMK if this works for you ! You can find more information on how device_map works here.

janEbert commented 3 weeks ago

Thanks a lot for the infer_auto_device_map and dispatch_model command! As you can tell, I would like to avoid loading the model on the CPU first so I'm not limited by RAM regarding the model size.

Sorry for not giving enough information in the first place. My use case is that I want to convert a model from custom code to HF "stdlib" code. The converted model is instantiated via from_config from a converted config and then I load the converted state dict into it. However, since device_map is not supported with from_config, I am limited by the CPU RAM. Even your really nice suggestions don't help in that case; not even the first one, since I'd still have to be able to instantiate the model on single-node CPU first. :/

SunMarc commented 3 weeks ago

Even your really nice suggestions don't help in that case; not even the https://github.com/huggingface/transformers/issues/26901#issuecomment-2422621147, since I'd still have to be able to instantiate the model on single-node CPU first. :/

How big is the model ? The model should be sharded, so it should only take max_shard_size in term of memory. I think that in save_pretrained, we set the shard size to 5GB. Also, if the model is in safetensors format, we should be able to load the model directly to the gpu without passing by the cpu.

janEbert commented 3 weeks ago

Would this work in the multi-node setting as well? Because the model is too big to fit on one node. Sorry that wasn't clear.

SunMarc commented 2 weeks ago

So this doesn't work on multi-node setting. However, we are working on making transformers models compatible with PP/TP methods from pytorch that works with multi-node !

ArthurZucker commented 2 weeks ago

34184 for the linked PR! 🤗