Open HungYiChen opened 3 months ago
And I can run the same model on my rtx2060+16gb ram PC use code:
model = AutoModelForCausalLM.from_pretrained(
"MediaTek-Research/Breeze-7B-Instruct-v1_0",
device_map="auto",
torch_dtype=torch.bfloat16,
offload_buffers=True
...
input_tensors["input_ids"].to("cuda")
...
)
I tried changing FP32 to FP16 to run the above code(torch_directml) but still not enough memory.
@HungYiChen Thanks for reporting this issue. The issue seems related to not being able to allocate tensors larger than 4GB, which we found hidden by another error. We're currently looking into a fix for this
I follow Enable PyTorch with DirectML on Windows and can use AMD GPU to run simple calculations:
tensor([2., 4., 6.], device='privateuseone:0')
But it gets stuck when using Breeze-7B LLM. Did I go wrong somewhere?
PS C:\Users\hung> & C:/Users/hung/miniconda3/envs/pydml/python.exe c:/Users/hung/Desktop/Breeze-7B.py Loading checkpoint shards: 100%|███████████████████████| 4/4 [00:22<00:00, 5.72s/it] Traceback (most recent call last): File "c:\Users\hung\Desktop\Breeze-7B.py", line 11, in
model.to(dml)
File "C:\Users\hung\miniconda3\envs\pydml\lib\site-packages\transformers\modeling_utils.py", line 2576, in to
return super().to(*args, **kwargs)
File "C:\Users\hung\miniconda3\envs\pydml\lib\site-packages\torch\nn\modules\module.py", line 1145, in to
return self._apply(convert)
File "C:\Users\hung\miniconda3\envs\pydml\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "C:\Users\hung\miniconda3\envs\pydml\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "C:\Users\hung\miniconda3\envs\pydml\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "C:\Users\hung\miniconda3\envs\pydml\lib\site-packages\torch\nn\modules\module.py", line 820, in _apply
param_applied = fn(param)
File "C:\Users\hung\miniconda3\envs\pydml\lib\site-packages\torch\nn\modules\module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: Could not allocate tensor with 234881024 bytes. There is not enough GPU video memory available!