vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
735 stars 60 forks source link

CUDA OOM while saving compressed Llama-3.1-70b with AutoModelForCausalLM #928

Open hibukipanim opened 5 days ago

hibukipanim commented 5 days ago

Describe the bug

Trying to quantize nvidia/Llama-3.1-Nemotron-70B-Instruct-HF to fp8-dynamic works with SparseAutoModelForCausalLM, but when I tried replacing to AutoModelForCausalLM as new docs suggest, I get CUDA OOM during "Saving checkpoints shards"

Environment

1xA100 80GB 150GB CPU RAM

llmompressor 0.3.0 transformers 4.43.3 torch 2.3.1

To Reproduce

Follow the example in examples/quantization_w8a8_fp8/llama3_example.py just with nvidia/Llama-3.1-Nemotron-70B-Instruct-HF

Errors

Saving checkpoint shards: 12% ...
...
File "transformers/modeling_utils.py", line 2739, in save_pretrained
  shard_state_dict = get_state_dict_from_offload(module, module_name, shard_state_dict)
...
File "accelerate/utils/modeling", line 321, in set_module_tensor_to_device
  new_value = old_value.to(device)

torch.cuda.OutOfMemoryError: CUDA Out Of Memory. Tried to allocate 448.00 MiB. GPU

Thanks! 🙏

kylesayrs commented 4 days ago

@hibukipanim Thank you for opening an issue! Could you please share

  1. Your cuda device set up
  2. model.hf_device_map of the model after you load it?
  3. The full stack trace of the error you're encountering?

Thanks!

hibukipanim commented 3 days ago

Hi thanks,

It's a one A100 80GB card, Driver version: 535.104.12 CUDA version: 12.2

model.hd_device_map is:

regarding the stacktrace, I unfortunately can't paste the full stacktrace as the A100 is in an air-gapped environment, I tried my best to type-in manually the indicative bits from the stacktrace in the issue 🙏

horheynm commented 3 days ago

@hibukipanim

Could you share the accelerate version you are using? Recently there was an update and the new version (using 1.1.1) will trigger the above error.

While we work on the patch, using the older version (1.0.1) will solve the problem.

The command to install is python3 -m pip install accelerate=1.0.1

Thank you and let us know!

kylesayrs commented 2 days ago

Hi there,

I traced this issue further and found that this is actually related to a very subtle issue caused by https://github.com/huggingface/accelerate/pull/3204. I've included a detailed description of the bug and its solution in the below PRs. Adopting either of these changes will fix the OOM issue, please let me know otherwise.

Thank you again for reporting this bug! Your efforts helped us identify this and address this quickly! 🙂

hibukipanim commented 1 day ago

Hi,

Can confirm I had accelerate==1.1.1 and after downgrading to accelerate==1.0.1 it worked with AutoModelForCausalLM

Thank you! 🙏

hibukipanim commented 1 day ago

Wait, but I noticed something strange ... The saved model is not quantized when I'm using AutoModelForCausalLM ... it's double the size compared to one saved with SparseAutoModelForCausalLM and about the same size as the original ... despite following examples/quantization_w8a8_fp8/llama3_example.py

horheynm commented 13 hours ago

@hibukipanim Ok let me check on that

horheynm commented 11 hours ago

@hibukipanim Can you share your commit version and the example script? The script wont save the quantize version if oneshot(model=model, recipe=recipe) is not run