huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.66k stars 26.21k forks source link

BitsNBytes 4 bit quantization error message typo and logical errors in error message handling #30751

Closed jkterry1 closed 2 months ago

jkterry1 commented 3 months ago

System Info

Newest version of transformers, accelerate, bitsandbytes in a docker container (nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04), Ubuntu 20.04 system, Arch Linux laptop

Who can help?

@SunMarc and @younesbelkada

Information

Tasks

Reproduction

from transformers import RobertaForSequenceClassification, RobertaTokenizer
eval_model_path = "hubert233/GPTFuzz"
tokenizer = RobertaTokenizer.from_pretrained(eval_model_path)
eval_model = RobertaForSequenceClassification.from_pretrained(
    eval_model_path,
    low_cpu_mem_usage=True,
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16
    ),
)
eval_model.eval()

Impacts other models when using BNB 4 bit, e.g. meta-llama/Llama-2-7b-chat-hf

Expected behavior

On a system where cuda isn't working (e.g. a laptop with no Nvidia GPU or a container that wasn't properly connected to the host OS driver), running that code snippet gives an import error:

image

The logic at issue starts here:

https://github.com/huggingface/transformers/blob/e0c3cee17085914bbe505c159beeb8ae39bc37dd/src/transformers/quantizers/quantizer_bnb_4bit.py#L60

  1. The if not torch.cuda.is_available(): error message seemingly is frequently bypassed on multiple systems with no torch cuda functionality, passing the system state onto if not (is_accelerate_available() and is_bitsandbytes_available()):
  2. is_bitsandbytes_available()) calls out to https://github.com/huggingface/transformers/blob/e0c3cee17085914bbe505c159beeb8ae39bc37dd/src/transformers/utils/import_utils.py#L749 and is_accelerate_available() calls out to https://github.com/huggingface/transformers/blob/e0c3cee17085914bbe505c159beeb8ae39bc37dd/src/transformers/utils/import_utils.py#L819 . In the instance where the first is check is bypassed, the if not torch.cuda.is_available(): function that's built into the is_bitsandbytes_available()) fails, resulting in getting a wildly confusing error message of "Using bitsandbytes 8-bit quantization requires Accelerate: pip install accelerate and the latest version of bitsandbytes: pip install -i https://pypi.org/simple/ bitsandbytes" when accelerate and bitsandbytes are both installed.
  3. Ignoring the issue in the torch call that resulted in my getting into this in the first place, I believe that and is tthe incorrect logical operator to use in https://github.com/huggingface/transformers/blob/e0c3cee17085914bbe505c159beeb8ae39bc37dd/src/transformers/quantizers/quantizer_bnb_4bit.py#L63 , because the error message says the problem is in accelerate not being present while bitsandbytes is
  4. The error message says "Using bitsandbytes 8-bit quantization" in the 4-bit version of the file
  5. I suspect that the original issue with the torch cuda availability check being erroneously passed is caused by the logic in this line: https://github.com/huggingface/transformers/blob/e0c3cee17085914bbe505c159beeb8ae39bc37dd/src/transformers/quantizers/quantizer_bnb_4bit.py#L29

Additionally, having is_bitsandbytes_available()) call torch.cuda.is_available() seems like a non obvious and non modular design choice that's likely to result in similar very misleading and hard to debug error messages in the future.

I think that the underlying logical issues here also likely resulted in this other github sub-issue: https://github.com/huggingface/transformers/issues/29177#issuecomment-1957180468

cw235 commented 3 months ago

It seems like you are facing an issue with running the code snippet provided in a system where CUDA is not available, resulting in an import error related to torch.cuda. The issue stems from the logic within the Transformers library that checks for CUDA availability before enabling BitsAndBytes functionality.

Here are some steps and potential solutions to address this problem:

  1. Identifying the Issue:

    • The code snippet includes logic that checks for the availability of CUDA before enabling BitsAndBytes functionality. This check is necessary to ensure that CUDA operations can be used efficiently. However, in systems without NVIDIA GPUs or proper CUDA configurations, this check can cause import errors.
  2. Solutions:

    • Option 1: Error Handling:

      • Modify the code to handle the case where CUDA is not available gracefully. You can add conditional logic to bypass the CUDA check if the system does not support it.
    • Option 2: Environment Configuration:

      • Ensure that your Docker container is correctly configured to access the host system's CUDA driver. This configuration is crucial for running CUDA-dependent operations within the container.
    • Option 3: Alternative Device Configuration:

      • Consider specifying a different device for model computation, such as CPU, when CUDA is not available. This approach can prevent errors related to CUDA dependencies.
  3. Further Investigation:

    • Review the logic within the Quantizer module of the Transformers library that handles CUDA availability checks. Understanding this logic may provide insights into why the conditional checks are not working as expected in your system configuration.
  4. Community Support:

    • Reach out to the Transformers library maintainers, such as @SunMarc and @younesbelkada, for assistance with debugging the specific issue related to CUDA availability checks and BitsAndBytes functionality.

By exploring these solutions and seeking guidance from the Transformers community, you can work towards resolving the import error caused by the CUDA availability check when running the provided code snippet in systems without NVIDIA GPUs or CUDA support.

SunMarc commented 3 months ago

Hi @jkterry1, thanks for this detailed report ! For 3. and 4. , let me know if you want to submit a PR to fix the logger message and split into two checks ! Otherwise, I can do it ! For 1. and 5., this is indeed strange that the first cuda check passes but failed in the second check in bitsandbytes. We can pontentially remove the cuda import check in is_bitsandbytes_available but it would be better for the first check to work correctly.

jkterry1 commented 3 months ago

Thank you so much!

If you'd be willing to do PRs yourself for 3 and 4, I'd be extraordinarily grateful (also note that it's likely that you'll need to make the fix in 3 to the 8 bit version of this file).

Regarding options 1 and 4, I personally believe that you should remove the cuda check from is_bitsandbytes_available so that when the function returns false it's only for the expected reason, and perform all cuda availability checks outside of this to prevent future unexpected errors.

Additionally, I think it would likely be prudent to verify that the code snippet to check for cuda described in 4 that threw a false positive for me and started me down this journey correctly throws a negative in test environments without GPUs, because it threw a false positive in the docker image I described as well as a random Arch linux laptop, suggesting something very unlikely happened to me or there's something wrong in the logic.

jkterry1 commented 2 months ago

Bumping so this doesn't go stale

SunMarc commented 2 months ago

Thx for the reminder ! I've created the PR !