Closed leszkolukasz closed 4 months ago
I'll try to check what is going on. In the meantime, I recommend you to use directly the integration of bitsandbytes in transformers by doing AutoModelForCausalLM.from_pretrained(self.model_name, load_in_8bit=True, device_map="auto")
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@SunMarc Just curious fundamentally why is load_and_quantize_model()
using more memory than AutoModelForCausalLM.from_pretrained(self.model_name, load_in_8bit=True, device_map="auto")
?
In the following 2 examples, first one succeeds and 2nd one fails with OOM on GPU.
Example 1:
quantization_config = BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
device_map = "auto",
trust_remote_code = True,
quantization_config=quantization_config,
)
Example 2:
checkpoint = MODEL_PATH
config = AutoConfig.from_pretrained(checkpoint)
model = None
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
bnb_quantization_config = BnbQuantizationConfig(
load_in_4bit=True,
torch_dtype=torch.bfloat16,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
quantized_model = load_and_quantize_model(model,
weights_location=MODEL_PATH,
bnb_quantization_config=bnb_quantization_config,
device_map = 'auto',
)
quantized_model
Any insights is appreciated.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @pinchedsquare, there shouldn't be any big difference. The goal of load_and_quantize_model
is to enable any pytorch model to be quantized with bnb, not just models from transformers library. I'll have a look this strange OOM asap.
Will await your response. Thanks.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Expected behavior
I am trying to quantize and save a big model, however, there are a few issues with that. While it does manage to load the model, it fails when trying to save it with the following error:
I remember having a similar issue before on another model and I think I solved it by chaning device_map from custom dict to "auto". It does not help in this case, though.
When I try to run inference on the model two issues arise. If I don't set max_memory to very low values I will get CUDA: out of memory error. If I heavily limit max_memory I get:
I have almost identical code that I used to quantize deep seek coder 33b and it worked there. Though when I tried to run the quantized version of deep seek coder 33b on V100 (now I am running on A100) I had problems with CUDA: out of memory. Note that I had no problems with memory when loading unquantized (loaded with load_checkpoint_and_dispatch) version of deep seek coder 33b on V100 and inference worked as well, though it was very slow.
I am not sure if I just don't have enough memory to run this or this is caused by something else.