cuda device is wrongly requested instead of xpu running pipeline(device_map="auto", max_memory": {0: 1.0e+10})

dvrogozh commented 3 months ago

Found on this code versions: https://github.com/huggingface/transformers/commit/7f79a97399bb52aad8460e1da2f36577d5dccfed, https://github.com/huggingface/accelerate/commit/e1247de01e0733c5d21075cb6f39b2605f4be123, https://github.com/pytorch/pytorch/commit/3477ee38e4dd1429ecfd7e6f20a30cce0f4f78e7. This is an issue with XPU support in stock pytorch (i.e. without using IPEX).

Assume to have XPU gpu on the system, no CUDA. Also assume that model does not fit the device memory. In this case execution will fail since some tensors are wrongly sent to CUDA device. I noted this issue trying to run:

The following example script reproduces the issue on LLAMA 3 8B model by creating memory constrain on XPU device:

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16, "max_memory": {0: 1.0e+10}},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][-1])

Log output:

2024-07-12 21:31:29.074576: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-12 21:31:29.102067: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-12 21:31:29.102092: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-12 21:31:29.102890: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-12 21:31:29.107401: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-12 21:31:29.681850: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.87it/s]
Some parameters are on the meta device device because they were offloaded to the disk.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
/home/gta/git/huggingface/transformers/src/transformers/generation/utils.py:1537: UserWarning: The operator 'aten::isin.Tensor_Tensor_out on the XPU backend is falling back to run on the CPU. (Triggered internally at /home/gta/git/pytorch/pytorch/build/aten/src/ATen/xpu/RegisterXPU.cpp:5706.)
  if eos_token_id is not None and torch.isin(elements=eos_token_id, test_elements=pad_token_id).any():
Traceback (most recent call last):
  File "/home/gta/examples/meta-llama/Meta-Llama-3-8B-Instruct/run.py", line 23, in <module>
    outputs = pipeline(
  File "/home/gta/git/huggingface/transformers/src/transformers/pipelines/text_generation.py", line 257, in __call__
    return super().__call__(Chat(text_inputs), **kwargs)
  File "/home/gta/git/huggingface/transformers/src/transformers/pipelines/base.py", line 1254, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/home/gta/git/huggingface/transformers/src/transformers/pipelines/base.py", line 1261, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/home/gta/git/huggingface/transformers/src/transformers/pipelines/base.py", line 1161, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/gta/git/huggingface/transformers/src/transformers/pipelines/text_generation.py", line 351, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home/gta/git/pytorch/pytorch/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/gta/git/huggingface/transformers/src/transformers/generation/utils.py", line 1972, in generate
    result = self._sample(
  File "/home/gta/git/huggingface/transformers/src/transformers/generation/utils.py", line 2943, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/home/gta/git/pytorch/pytorch/torch/nn/modules/module.py", line 1716, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gta/git/pytorch/pytorch/torch/nn/modules/module.py", line 1727, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gta/git/huggingface/accelerate/src/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/gta/git/huggingface/transformers/src/transformers/models/llama/modeling_llama.py", line 1069, in forward
    outputs = self.model(
  File "/home/gta/git/pytorch/pytorch/torch/nn/modules/module.py", line 1716, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gta/git/pytorch/pytorch/torch/nn/modules/module.py", line 1727, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gta/git/huggingface/transformers/src/transformers/models/llama/modeling_llama.py", line 873, in forward
    layer_outputs = decoder_layer(
  File "/home/gta/git/pytorch/pytorch/torch/nn/modules/module.py", line 1716, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gta/git/pytorch/pytorch/torch/nn/modules/module.py", line 1727, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gta/git/huggingface/accelerate/src/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/gta/git/huggingface/transformers/src/transformers/models/llama/modeling_llama.py", line 609, in forward
    hidden_states = self.input_layernorm(hidden_states)
  File "/home/gta/git/pytorch/pytorch/torch/nn/modules/module.py", line 1716, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/gta/git/pytorch/pytorch/torch/nn/modules/module.py", line 1727, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gta/git/huggingface/accelerate/src/accelerate/hooks.py", line 164, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/home/gta/git/huggingface/accelerate/src/accelerate/hooks.py", line 335, in pre_forward
    value = self.weights_map[name]
  File "/home/gta/git/huggingface/accelerate/src/accelerate/utils/offload.py", line 118, in __getitem__
    return self.dataset[f"{self.prefix}{key}"]
  File "/home/gta/git/huggingface/accelerate/src/accelerate/utils/offload.py", line 171, in __getitem__
    tensor = f.get_tensor(weight_info.get("weight_name", key))
  File "/home/gta/git/pytorch/pytorch/torch/cuda/__init__.py", line 305, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 @sywangyi @yao-matrix

dvrogozh commented 3 months ago

There was an assumption that https://github.com/pytorch/pytorch/pull/129119 PR will address this issue. Unfortunately it does not - I still see the same issue. I did add printouts to the toDevice() function and can tell that this function is called and 129119 PR is playing the game - I see that patched code branch is getting hit. However, I also see that ultimately we get cuda:0 here: https://github.com/pytorch/pytorch/blob/95046c86e3547e46ef5733925e02278e46c5c6d4/torch/csrc/utils/python_arg_parser.h#L823. Thus, I think there is one more place somewhere hardcoding cuda.

dvrogozh commented 3 months ago

Root cause is that safetensors library hardcodes to return cuda device if only device index is provided, i.e. here: https://github.com/huggingface/safetensors/blob/079781fd0dc455ba0fe851e2b4507c33d0c0d407/bindings/python/src/lib.rs#L297

The fix probably should be to return device calling torch.device(N).

Filed https://github.com/huggingface/safetensors/issues/499.

amyeroberts commented 3 months ago

Hl @dvrogozh, thanks for raising this issue, for the deep dive and for taking the time to write up such a detailed explanation of the problem and what you've tried - it's incredibly appreciated!

If I've understood correctly, there isn't anything to do on the transformers side and it's pending a resolution in the safetensors library?

cc @muellerzr @sun for reference as this touches accelerate and offloading logic

dvrogozh commented 3 months ago

If I've understood correctly, there isn't anything to do on the transformers side and it's pending a resolution in the safetensors library?

That is right. I initially filed this issue here since transformers use case was affected and root cause was not clear at the moment of filing.

amyeroberts commented 3 months ago

@dvrogozh OK, no worries. Just so we know if there's an action point for the team. We can leave as open until the safetensors issue is resolved and we can confirm that pipeline will run as expected.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers

cuda device is wrongly requested instead of xpu running pipeline(device_map="auto", max_memory": {0: 1.0e+10}) #31941