Closed kevalshah90 closed 4 months ago
Hi @kevalshah90
Thanks for the issue! You might be hitting a corner case here, can you try to force-set the model on a single GPU with device_map="cuda:0"
in from_pretrained
?
@younesbelkada I am still getting the same error.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
I added device_map
:
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
quantization_config=bnb_config,
device_map="cuda:0"
)
model
For reference, here is my bnb_config
:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=torch.bfloat16
)
bnb_config
@younesbelkada I tried to downgrade the accelerate
from 0.30.0
to 0.29.0
and am still getting the same error.
any ideas / suggestions on debugging this issue? I am not sure why it would split across gpu and cpu.
@kevalshah90 can you double check the inputs are on GPU? Maybe doing inputs = tokenizer("Do you have time", return_tensors="pt").to(0)
would help
@younesbelkada Yes, I tried .to(0)
from the demo notebook here.
I still get the same error. Before passing the input to the model, I am also printing the input and model device,
peft_model.device
device(type='cuda', index=0)
# Test
inputs = tokenizer("Do you have time", return_tensors="pt").input_ids.to(0)
print("Inputs:", inputs)
Inputs: tensor([[ 1, 2378, 368, 506, 727]], device='cuda:0')
with torch.no_grad():
outputs = peft_model(inputs)
print("Outputs:\n", outputs.logits)
print("Outputs dimensions:", outputs.logits.shape)
Error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
Accelerate
library seems to be splitting between the gpu
and cpu
under-the-hood.
https://huggingface.co/docs/accelerate/concept_guides/big_model_inference#loading-weights
I am thinking maybe there is something in here:
[infer_auto_device_map()](https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.infer_auto_device_map) (or device_map="auto" in [load_checkpoint_and_dispatch()]
(https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.load_checkpoint_and_dispatch)) tries to maximize GPU and CPU RAM it sees available when you execute it. While PyTorch is very good at managing GPU RAM efficiently (and giving it back when not needed), it’s not entirely true with Python and CPU RAM. Therefore, an automatically computed device map might be too intense on the CPU. Move a few modules to the disk device if you get crashes due to a lack of RAM.
[infer_auto_device_map()](https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.infer_auto_device_map) (or device_map="auto" in [load_checkpoint_and_dispatch()]
(https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.load_checkpoint_and_dispatch)) attributes devices sequentially (to avoid moving things back and forth) so if your first layer is bigger than the size of the GPU you have, it will end up with everything on the CPU/Disk.
[load_checkpoint_and_dispatch()](https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.load_checkpoint_and_dispatch) and [load_checkpoint_in_model()](https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.load_checkpoint_in_model) do not perform any check on the correctness of your state dict compared to your model at the moment (this will be fixed in a future version), so you may get some weird errors if trying to load a checkpoint with mismatched or missing keys.
The model parallelism used when your model is split on several GPUs is naive and not optimized, meaning that only one GPU works at a given time and the other sits idle.
When weights are offloaded on the CPU/hard drive, there is no pre-fetching (yet, we will work on this for future versions) which means the weights are put on the GPU when they are needed and not before.
@younesbelkada
For context, we are on aws ml.g4dn.xlarge instance
which is a NVIDIA T4 GPU
with 16 GiB
memory.
@younesbelkada
Here's how it seems to have been fixed, I added the below few lines of code:
# Set the device (replace 'cuda:0' with the appropriate GPU if you have multiple GPUs)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# Set the device for PyTorch
torch.cuda.set_device(device)
peft_model = peft_model.to(device)
peft_model.hf_device_map
{'': device(type='cuda', index=0)}
Thanks for explaining the fix ! This might be related too: https://github.com/huggingface/peft/issues/1840
System Info
Single
T4
GPU.I am implementing QLoRA for fine-tuning a
mistral-7b
on aT4
gpu. I loaded the model with the quantized configuration, however, when I attempt to test the model I keep getting an runtime error related to device. I ensured that the model and inputs are oncuda
.Who can help?
@sayakpaul @younesbelkada @BenjaminBossan
Information
Tasks
examples
folderReproduction
Error:
Expected behavior
Output logits.