I am trying to port the transformers based AutoModelForCausalLM to optimum.nvidia and I hit OutOfMemory. I assume I need to add quantization_config like I do with the transformers:
from transformers import AutoModelForCausalLM,BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(llm_int8_threshold=200.0,load_in_8bit=True)
and pass it in:
model = AutoModelForCausalLM.from_pretrained(model_id,device_map='auto',torch_dtype=torch.float16,quantization_config=quantization_config,)
I am trying to port the
transformers
basedAutoModelForCausalLM
tooptimum.nvidia
and I hitOutOfMemory
. I assume I need to add quantization_config like I do with thetransformers
:and pass it in:
The script I run:
The error: