Open hibukipanim opened 5 days ago
@hibukipanim Thank you for opening an issue! Could you please share
model.hf_device_map
of the model after you load it?Thanks!
Hi thanks,
It's a one A100 80GB card, Driver version: 535.104.12 CUDA version: 12.2
model.hd_device_map
is:
0
for all layers up until model.layers.41
(including it) and including model.embed_tokens
'cpu'
from model.layers.42
till model.layers.79
and including model.norm
, model.rotary_emb
and lm_head
regarding the stacktrace, I unfortunately can't paste the full stacktrace as the A100 is in an air-gapped environment, I tried my best to type-in manually the indicative bits from the stacktrace in the issue 🙏
@hibukipanim
Could you share the accelerate version you are using?
Recently there was an update and the new version (using 1.1.1
) will trigger the above error.
While we work on the patch, using the older version (1.0.1
) will solve the problem.
The command to install is
python3 -m pip install accelerate=1.0.1
Thank you and let us know!
Hi there,
I traced this issue further and found that this is actually related to a very subtle issue caused by https://github.com/huggingface/accelerate/pull/3204. I've included a detailed description of the bug and its solution in the below PRs. Adopting either of these changes will fix the OOM issue, please let me know otherwise.
Thank you again for reporting this bug! Your efforts helped us identify this and address this quickly! 🙂
Hi,
Can confirm I had accelerate==1.1.1
and after downgrading to accelerate==1.0.1
it worked with AutoModelForCausalLM
Thank you! 🙏
Wait, but I noticed something strange ...
The saved model is not quantized when I'm using AutoModelForCausalLM
... it's double the size compared to one saved with SparseAutoModelForCausalLM
and about the same size as the original ...
despite following examples/quantization_w8a8_fp8/llama3_example.py
@hibukipanim Ok let me check on that
@hibukipanim
Can you share your commit version and the example script?
The script wont save the quantize version if oneshot(model=model, recipe=recipe)
is not run
Describe the bug
Trying to quantize
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
to fp8-dynamic works withSparseAutoModelForCausalLM
, but when I tried replacing toAutoModelForCausalLM
as new docs suggest, I get CUDA OOM during "Saving checkpoints shards"Environment
1xA100 80GB 150GB CPU RAM
llmompressor 0.3.0 transformers 4.43.3 torch 2.3.1
To Reproduce
Follow the example in
examples/quantization_w8a8_fp8/llama3_example.py
just withnvidia/Llama-3.1-Nemotron-70B-Instruct-HF
Errors
Thanks! 🙏