huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.79k stars 26.46k forks source link

AutoGPTQ quantization stucks without any progress #29494

Closed franchukpetro closed 6 months ago

franchukpetro commented 6 months ago

System Info

Hardware details CPU - AMD Ryzen Threadripper PRO 3955WX 16-Cores GPU - NVIDIA RTX 4090

Software details OS - Ubuntu 22.04.3 LTS CUDA - 12.1 (I've also tried with 11.8) Python - 3.10.13 AutoGPTQ - 0.8.0.dev0+cu1211 (build from source, but I've also tried with 0.7.0 and 0.6.0 versions using pip install, with corresponding matching CUDA and PyTorch versions) PyTorch - 2.2.0+cu121 (also have tried different version to match requirements when using lower versions of AutoGPTQ) transformers - I have no information about it's version for now, since I've deleted the vm, but that was the default version of transformers, which were installed together with auto-gptq installation.

Who can help?

@SunMarc @younesbelkada

Information

Tasks

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_path = 'mistralai/Mistral-7B-v0.1'
quant_path = 'Mistral-7b-v0.1-gptq-int4'

print("Tokenizer initialization started...")

tokenizer = AutoTokenizer.from_pretrained(model_path)

print("Tokenizer initialization ended...")

gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)

print("GPTQ config initialization ended...")

quantized_model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", quantization_config=gptq_config)

print("Model quantization ended...")

quantized_model.to("cpu")
quantized_model.save_pretrained(quant_path)
tokenizer.save_pretrained(quant_path)

Expected behavior

I'm trying to do the GPTQ quantization of Mistral 7B model on Nvidia 4090 GPU on Vast.ai platform. However, my quantization process constantly stucks after the model weights and data for quantization are loaded - just there is no progress bar or other text output.

I was trying a lot of different combinations of CUDA, PyTorch and AutoGPTQ, but none of them worked. Moreover, I've tried to quantize Falcon 1B & 7B models, but they haven't succeed as well - Falcon 1B was starting the quantization procees, but then it crashed with CUDA out of memory error (which is weird, since 4090 GPU has 24GB of VRAM, which is more than enough to load Falcon 1B).

younesbelkada commented 6 months ago

Hi @franchukpetro Hmm that shouldn't happen IMO, can you try out on the latest version of optimum library?

franchukpetro commented 6 months ago

@younesbelkada thanks for the response!

I'm actually installing it from the source, as it is recommended in this tutorial. I guess it should be already latest version, isn't it?

pip install git+https://github.com/huggingface/optimum.git

franchukpetro commented 6 months ago

Okay, so I created two new VMs, and let the quantization process stuck for more than hour, and it resulted in CUDA out of memory error in both cases. For both VMs I was using docker image pytorch/pytorch:2.2.0-cuda12.1-cudnn8-devel. Such docker image has CUDA v12.1 (verified with nvcc) and PyTorch v2.2.0.

First VM

I've done next installation steps, as described in Transformers' Quantization tutorial:

pip install auto-gptq
pip install git+https://github.com/huggingface/optimum.git
pip install git+https://github.com/huggingface/transformers.git
pip install --upgrade accelerate

transformers-cli env output:

Second VM

For the second VM I uninstall torch, torchvision and torchaudio, and manually installe PyTorch 2.2.0 build specifically for CUDA 12.1

pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121

Next, I've installed AutoGPTQ from source, and installed optimum, transformer and accelerate similarly to first VM.

transformers-cli env output:

Error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 194.06 MiB is free. Process 3736189 has 23.45 GiB memory in use. Of the allocated memory 22.77 GiB is allocated by PyTorch, and 256.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

As I already mentioned, I was also facing CUDA out of memory errors with smaller models, such as Falcon 1B, which even theoretically cannot reach the max capacity of the GPU RAM (I was doing similar experiments with Falcons on 3080 with 16GB VRAM, and everything was smooth).

SunMarc commented 6 months ago

Hi @franchukpetro, normally, you should see a tqdm progress to follow the quantization progress. If you don't see it, it means that there might be an issue when we load the model or when we process the dataset. LMK which case it is. It will help me debug this issue !

franchukpetro commented 6 months ago

I don't see the tqdm progress bar, and that's what I'm worried as well.

There appears progress bars for model weights loading, quantization dataset loading, and the last step which is visible with progress bar is training splits generation. After that there is no outputs for around 1-2h, and after that it crashes with CUDA OOM error.

Interestingly, I was observing from the vast.ai cloud platform the load of GPU, and GPU is not loaded at all, while GPU VRAM firstly is almost fully free, while after a few minutes of such stucking it becames to be ±15GB loaded for a while until crashing.

SunMarc commented 6 months ago

The PR above should do the trick. If you don't want to checkout to this PR, you can pass gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer, model_seqlen=2048). The issue lied in the model_seqlen used to create the calibration dataset. For mistral, we have a very big value 32768 and this created some issues. LMK if this works !

franchukpetro commented 6 months ago

That worked, quantization has started successfully!

Thank you, you saved me!

franchukpetro commented 6 months ago

@SunMarc I've faced same issue with CUDA out of memory for Falcon 1B/7B and Gemma 2B/7B. Setting model_seqlen parameter is not helping, since these models have different API as far as I understand.

I've tried to build optimum from the source, with that small fix for gptq quantization, but that didn't help as well.

Is that some common problem with GPTQ right now, or could that be some issue with setup on my side?

SunMarc commented 6 months ago

Hey @franchukpetro , In the falcon/gemma case, does the quantization starts (tqdm progress bar) or it just OOM ?

franchukpetro commented 6 months ago

@SunMarc As far as I remember for Falcon quantization starts, but drops in the middle of the process. For Gemma I don't remember 100%, but I guess it was dropping right at the start without tqdm progress bar appearing.

SunMarc commented 6 months ago

Thanks for the details, I'll investigate a bit when I get the time!