huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.71k stars 26.22k forks source link

DoLa decoding fails on cuda #31996

Closed kosstbarz closed 3 weeks ago

kosstbarz commented 1 month ago

DoLa decoding on Mixtral model with multi-GPU setup returns error:

Traceback (most recent call last):
  File "~/src/evaluation/test.py", line 8, in <module>
    generate_ids = model.generate(inputs.input_ids, max_length=200, dola_layers='low')
  File "~/src/evaluation/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "~/src/evaluation/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1921, in generate
    result = self._dola_decoding(
  File "~/src/evaluation/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2366, in _dola_decoding
    next_token_logits = _dola_select_contrast(
  File "~/src/evaluation/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 4440, in _dola_select_contrast
    kl1 = F.kl_div(log_softmax_mature_layer[None, :, :], avg_dist, reduction="none").mean(-1)
  File "~/src/evaluation/.venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2988, in kl_div
    reduced = torch.kl_div(input, target, reduction_enum, log_target=log_target)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

System Info

Who can help?

No response

Information

Tasks

Reproduction

This script uses DoLa decoding new feature with multi GPU setup. Run the file test.py:

import transformers
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
model = transformers.AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

prompt = "Hey, are you conscious? Can you talk to me?\n\n"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=200, dola_layers='low')
result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

with this command CUDA_VISIBLE_DEVICES=0,1,2,3 python test.py

Expected behavior

No errors.

amyeroberts commented 1 month ago

cc @gante

ashishpatel26 commented 1 month ago

The issue arises because tensors are being allocated on different devices (CUDA and CPU), causing a RuntimeError. This often happens when using multi-GPU setups and certain layers or operations are not properly configured to use the same device.

Possible Solution : Ensure all tensors are on the same device (CUDA) before running the generate function.

Please try below script may be it helps

import transformers

# Load the model and tokenizer
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# Define the prompt and tokenize
prompt = "Hey, are you conscious? Can you talk to me?\n\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate the output
generate_ids = model.generate(inputs.input_ids, max_length=200, dola_layers='low')

# Decode and print the result
result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(result)

By moving the input tensors to CUDA using .to("cuda"), we ensure all operations are performed on the same device, thus resolving the error.

kosstbarz commented 1 month ago

Thank you @ashishpatel26 for your response. I tried .to("cuda") and get Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! The same error I get with .to("cuda:0") and .to("cuda:3") The same result when I use model.generate(inputs.input_ids.to("cuda:3") Frankly I expected that model with device_map == "auto" would put the input tensor on the correct device. Because base Mixtral model (with no dola_layers) works fine with input tensor on cpu.

amyeroberts commented 3 weeks ago

Gentle ping @gante

gante commented 3 weeks ago

hehe a .to was missing in the DoLa body, opening a PR to fix it

(TL;DR the logits were on the same device as the model inputs, but the auxiliary DoLa tensors were on the same device as the lm head. On multi-gpu settings, that's not the same device.)