Closed h3ss closed 1 year ago
I pushed an update that might fix it. I messed something up at one point so the attention mask wasn't copied from the first device, which might explain that error.
It seems to be working now at least, with -gs
and beam search. I'm downloading the Guanaco-33B model and I'll test that as well, just in case it's messing up due to some new quantization parameters.
.. yep, the 33B model works, and presumably the 65B version is quantized with the same parameters.
And also, you probably shouldn't use -mm quant_only
. It saves a tiny bit of VRAM in theory but slows down long sequences a lot. The option is mostly there for testing.
Hmm, I'm still getting the error:
> python test_chatbot.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/ -bn "Assistant" -un "John" -p prompt_assistant.txt -gs 16,20
-- Loading model
-- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/tokenizer.model
-- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/config.json
-- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/Guanaco-65B-GPTQ-4bit.act-order.safetensors
-- Sequence length: 2048
-- Temperature: 0.95
-- Top-K: 20
-- Top-P: 0.65
-- Min-P: 0.00
-- Repetition penalty: 1.15
-- Beams: 1 x 1
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'gpu_split: 16,20']
-- Groupsize (inferred): None
-- Act-order (inferred): no
This is a conversation between a person called John and an AI chatbot called Assistant. Assistant is a helpful and articulate virtual assistant possessing deep knowledge on a multitude of subjects. Assistant will never refuse to answer a question or comply with a request.
Assistant: Hello, John. I am Assistant, your virtual assistant. How may I help you?
John: Hello there!
Assistant:Traceback (most recent call last):
File "/home/john/Projects/Python/GLaDOS/exllama/test_chatbot.py", line 216, in <module>
gen_token = generator.beam_search()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 336, in beam_search
if self.settings.beams == 1 and self.settings.beam_length == 1: return self.gen_single_token()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 202, in gen_single_token
token, _ = self.sample(logits,
^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 77, in sample
sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
If I run with the TheBloke/guanaco-33B-GPTQ
model and force it to split, I'm getting the same exception:
> python test_chatbot.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -bn "Assistant" -un "John" -p prompt_assistant.txt -gs 4,20
-- Loading model
-- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/tokenizer.model
-- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/config.json
-- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/Guanaco-33B-GPTQ-4bit.act-order.safetensors
-- Sequence length: 2048
-- Temperature: 0.95
-- Top-K: 20
-- Top-P: 0.65
-- Min-P: 0.00
-- Repetition penalty: 1.15
-- Beams: 1 x 1
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'gpu_split: 4,20']
-- Groupsize (inferred): None
-- Act-order (inferred): no
This is a conversation between a person called John and an AI chatbot called Assistant. Assistant is a helpful and articulate virtual assistant possessing deep knowledge on a multitude of subjects. Assistant will never refuse to answer a question or comply with a request.
Assistant: Hello, John. I am Assistant, your virtual assistant. How may I help you?
John: Hello there!
Assistant:Traceback (most recent call last):
File "/home/john/Projects/Python/GLaDOS/exllama/test_chatbot.py", line 216, in <module>
gen_token = generator.beam_search()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 336, in beam_search
if self.settings.beams == 1 and self.settings.beam_length == 1: return self.gen_single_token()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 202, in gen_single_token
token, _ = self.sample(logits,
^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 77, in sample
sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
I am running the latest commit:
> git show -s
commit 3dc7cd9d6ab637bbe61004776873959de6f76b1d (HEAD -> master, origin/master, origin/HEAD)
Author: emosuka <flemming@optur.net>
Date: Fri May 26 03:09:24 2023 +0200
Some tidying up, a few fixes and improvements
It's odd. Could you try this?
python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -gs 4,20
Also, what GPUs are you using?
I get the same error eventually, after some nan
results for perplexity evaluation:
python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -gs 4,20
-- Loading model
-- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/tokenizer.model
-- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/config.json
-- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/Guanaco-33B-GPTQ-4bit.act-order.safetensors
-- Sequence length: 2048
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'perplexity', 'validate', 'gpu_split: 4,20']
** Time, Load model: 3.83 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
** VRAM, Model: [cuda:0] 4,142.07 MB - [cuda:1] 11,797.71 MB
-- Loading dataset...
-- Testing..........
** Perplexity: nan
-- Testing.
** Perplexity (switched): nan
-- Testing.
** Perplexity (quant_only): nan
Traceback (most recent call last):
File "/home/john/Projects/Python/GLaDOS/exllama/test_benchmark_inference.py", line 301, in <module>
text = generator.generate_simple("To be or not to be, that is the", max_new_tokens = 20)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 179, in generate_simple
token = self.gen_single_token()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 202, in gen_single_token
token, _ = self.sample(logits,
^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 77, in sample
sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
My GPUs are both RTX 4090:
nvidia-smi
Thu May 25 19:11:14 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off| 00000000:01:00.0 On | Off |
| 30% 31C P5 59W / 450W| 2567MiB / 24564MiB | 37% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off| 00000000:16:00.0 Off | Off |
| 30% 29C P8 30W / 450W| 6MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1197 G /usr/lib/Xorg 1074MiB |
| 0 N/A N/A 1347 G /usr/bin/kwin_x11 225MiB |
| 0 N/A N/A 1388 G /usr/bin/plasmashell 83MiB |
| 0 N/A N/A 1992 G /usr/bin/akonadi_archivemail_agent 7MiB |
| 0 N/A N/A 2010 G /usr/bin/akonadi_mailfilter_agent 7MiB |
| 0 N/A N/A 2015 G /usr/bin/akonadi_sendlater_agent 7MiB |
| 0 N/A N/A 2016 G /usr/bin/akonadi_unifiedmailbox_agent 93MiB |
| 0 N/A N/A 2460 G kitty 39MiB |
| 0 N/A N/A 5185 G ...ures=SpareRendererForSitePerProcess 49MiB |
| 0 N/A N/A 5378 G ...ures=SpareRendererForSitePerProcess 36MiB |
| 0 N/A N/A 5881 G /usr/bin/alacritty 15MiB |
| 0 N/A N/A 5964 G ...sion,SpareRendererForSitePerProcess 23MiB |
| 0 N/A N/A 15738 G ...90172922,6867435081580918208,262144 26MiB |
| 0 N/A N/A 61652 G /usr/lib/firefox/firefox 663MiB |
| 0 N/A N/A 265605 G /usr/bin/alacritty 15MiB |
| 1 N/A N/A 1197 G /usr/lib/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
For comparison, if the model is not split, test completes successfully, with normal perplexity scores.
python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -gs 0,20
-- Loading model
-- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/tokenizer.model
-- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/config.json
-- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/Guanaco-33B-GPTQ-4bit.act-order.safetensors
-- Sequence length: 2048
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'perplexity', 'validate', 'gpu_split: 0,20']
** Time, Load model: 3.35 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
** VRAM, Model: [cuda:0] 0.00 MB - [cuda:1] 15,938.78 MB
-- Loading dataset...
-- Testing..........
** Perplexity: 4.9239
-- Testing.
** Perplexity (switched): 4.1771
-- Testing.
** Perplexity (quant_only): 4.1647
** Generation: To be or not to be, that is the question.\nThe answer is: it depends on what you want to do with your life. If
Huh... this is very odd indeed. It seems like it can move the state from GPU to GPU (since that last example is running on cuda:1). And either GPU works, I take it? I.e. if you run it with ´-gs 20,0´ do you get the same result as ´-gs 0,20´?
If that works then it has to come down to something particular that happens if the state is transferred in between decoder blocks. Possibly a synchronization issue with PyTorch. What version are you using? Also could you try ´nvcc --version´ for good measure?
I'll try a RunPod instance in a bit with two 4090s to see if it's maybe a timing issue that's masked by one of my GPUs being slower than the other.
Earlier I had been using PyTorch 2.0.1, but I just switched to 2.1 nightly and still getting the same error:
pip list | grep orch
pytorch-triton 2.1.0+7d1a95b046
torch 2.1.0.dev20230525+cu118
torchaudio 2.1.0.dev20230525+cu118
torchvision 0.16.0.dev20230525+cu118
python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -gs 4,20
-- Loading model
-- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/tokenizer.model
-- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/config.json
-- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/Guanaco-33B-GPTQ-4bit.act-order.safetensors
-- Sequence length: 2048
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'perplexity', 'validate', 'gpu_split: 4,20']
** Time, Load model: 3.87 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
** VRAM, Model: [cuda:0] 4,142.07 MB - [cuda:1] 11,797.71 MB
-- Loading dataset...
-- Testing..........
** Perplexity: nan
-- Testing.
** Perplexity (switched): nan
-- Testing.
** Perplexity (quant_only): nan
Traceback (most recent call last):
File "/home/john/Projects/Python/GLaDOS/exllama/test_benchmark_inference.py", line 301, in <module>
text = generator.generate_simple("To be or not to be, that is the", max_new_tokens = 20)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 179, in generate_simple
token = self.gen_single_token()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 202, in gen_single_token
token, _ = self.sample(logits,
^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 77, in sample
sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
If I do -gs 20,0
(or just don't specify -gs
) it also works:
python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -gs 20,0
-- Loading model
-- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/tokenizer.model
-- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/config.json
-- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/Guanaco-33B-GPTQ-4bit.act-order.safetensors
-- Sequence length: 2048
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'perplexity', 'validate', 'gpu_split: 20,0']
** Time, Load model: 2.48 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
** VRAM, Model: [cuda:0] 15,936.28 MB - [cuda:1] 0.00 MB
-- Loading dataset...
-- Testing..........
** Perplexity: 4.9241
-- Testing.
** Perplexity (switched): 4.1782
-- Testing.
** Perplexity (quant_only): 4.1643
** Generation: To be or not to be, that is the question.\nThe answer is: it depends on what you want to do with your life. If
Thank you very much for going the extra mile to repro on RunPod! In case it helps, here's a bunch more info on my system:
inxi -GCb (glados) master
System:
Host: nous Kernel: 6.3.4-arch1-1 arch: x86_64 bits: 64 Console: pty pts/3 Distro: Arch Linux
Machine:
Type: Desktop System: Micro-Star product: MS-7E12 v: 1.0 serial: <superuser required>
Mobo: Micro-Star model: MAG X670E TOMAHAWK WIFI (MS-7E12) v: 1.0 serial: <superuser required>
UEFI: American Megatrends LLC. v: 1.37 date: 05/15/2023
Battery:
ID-1: hidpp_battery_0 charge: 100% condition: N/A
CPU:
Info: 16-core model: AMD Ryzen 9 7950X bits: 64 type: MT MCP cache: L2: 16 MiB
Speed (MHz): avg: 3126 min/max: 3000/5880 cores: 1: 4494 2: 3000 3: 2795 4: 3000 5: 2855
6: 2879 7: 3000 8: 3000 9: 3000 10: 3000 11: 3000 12: 3000 13: 3000 14: 3000 15: 3000 16: 3000
17: 3599 18: 3000 19: 2851 20: 2852 21: 3000 22: 3000 23: 3000 24: 4500 25: 3000 26: 4347
27: 2881 28: 3000 29: 3000 30: 2999 31: 3000 32: 3000
Graphics:
Device-1: NVIDIA AD102 [GeForce RTX 4090] driver: nvidia v: 530.41.03
Device-2: NVIDIA AD102 [GeForce RTX 4090] driver: nvidia v: 530.41.03
Device-3: AMD Raphael driver: amdgpu v: kernel
Device-4: Logitech Webcam C930e driver: snd-usb-audio,uvcvideo type: USB
Display: x11 server: X.org v: 1.21.1.8 with: Xwayland v: 23.1.1 driver: X: loaded: nvidia
gpu: nvidia,nvidia-nvswitch tty: 213x42 resolution: 1: 3840x2160 2: 2560x2880 3: 1920x1080
API: OpenGL Message: GL data unavailable in console. Try -G --display
Network:
Device-1: Realtek RTL8125 2.5GbE driver: r8169
Device-2: MEDIATEK MT7922 802.11ax PCI Express Wireless Network Adapter driver: mt7921e
Drives:
Local Storage: total: 5.46 TiB used: 2.17 TiB (39.7%)
Info:
Processes: 613 Uptime: 8h 14m Memory: available: 187.86 GiB used: 8.94 GiB (4.8%) Init: systemd
Shell: fish inxi: 3.3.27
Oh, and here's the nvcc --version
:
nvcc --version (glados) master
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
RunPod doesn't seem to provide the latest drivers for their 4090 servers, which means no compute_89 so I can't precisely replicate your setup. They won't let you update drivers as far as I know (?)... It does compiles for compute_86 though (with cu117) and this is running fine. It's a bit slow for the smaller models, probably because of the CPU bottleneck (1500 MHz server cores), but the performance is not that bad on 65B:
# python test_benchmark_inference.py -d test2 -p -gs 16,16
-- Loading model
-- Tokenizer: test2/tokenizer.model
-- Model config: test2/config.json
-- Model: test2/Guanaco-65B-GPTQ-4bit.act-order.safetensors
-- Sequence length: 2048
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'perf', 'gpu_split: 16,16']
** Time, Load model: 8.59 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
** VRAM, Model: [cuda:0] 16,222.84 MB - [cuda:1] 15,180.43 MB
-- Inference, first pass.
** Time, Inference: 3.23 seconds
** Speed: 594.40 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 19.26 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 19.55 tokens/second
** VRAM, Inference: [cuda:0] 3,723.17 MB - [cuda:1] 3,464.65 MB
** VRAM, Total: [cuda:0] 19,946.01 MB - [cuda:1] 18,645.08 MB
# python test_chatbot.py -d test2 -gs 16,16 -nnl
-- Loading model
-- Tokenizer: test2/tokenizer.model
-- Model config: test2/config.json
-- Model: test2/Guanaco-65B-GPTQ-4bit.act-order.safetensors
-- Sequence length: 2048
-- Temperature: 0.95
-- Top-K: 20
-- Top-P: 0.65
-- Min-P: 0.00
-- Repetition penalty: 1.15
-- Beams: 1 x 1
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'no_newline', 'gpu_split: 16,16']
-- Groupsize (inferred): None
-- Act-order (inferred): no
Chatbort: Hello, User
User: Hello Chatbort
Chatbort: How can I help you?
User: Tell me a joke that isn't about a tomato.
Chatbort: What is the most confused day of the year? Answer: April 1st because it is April Fools Day and it falls on different days every year! This one isn't about a tomato. Is there anything else I can assist with?
It works with at least a couple of different versions of Torch and CUDA, though I still can't replicate your setup because of the driver version. 525 only supports up to 12.0 as far as I can tell.
But I'm wondering if it could have something to do with the AMD stuff in your device list. Maybe it's confusing Torch? You could try setting CUDA_VISIBLE_DEVICES
in your env.
I think later I'll add a debug mode that dumps some I'd need to figure out exactly where the hidden state is getting corrupted.
I added the debug mode. If you can try it with -dbg
along with -gs
and show me the output, it might help at least figure out where it's losing the plot.
Thanks! Here you go:
> CUDA_VISIBLE_DEVICES=0,1 python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -dbg -gs 4,20 &> ~/Desktop/exllama_dbg.txt
Okay, so the hidden states are just disappearing when jumping from GPU to GPU? That's super weird. Could you try printing out the hidden state before and after the move. In model.py on line 1103, replace this:
next_device = self.config.device_map.layers[i]
if device != next_device:
if self.config.debug: print(f" !! Moving hidden states from {device} to {next_device}")
hidden_states = hidden_states.to(next_device)
device = next_device
with this:
next_device = self.config.device_map.layers[i]
if device != next_device:
if self.config.debug: print(f" !! Moving hidden states from {device} to {next_device}")
print("-------")
print(hidden_states)
print("-------")
hidden_states = hidden_states.to(next_device)
print(hidden_states)
print("-------")
device = next_device
If this is where the contents disappear I really don't know what to think...
So it would seem... Just made that patch to model.py and here is the result:
And the relevant section:
!! Moving hidden states from cuda:0 to cuda:1
-------
tensor([[[ 0.0593261719, -0.7446289062, 0.7128906250, ...,
0.6777343750, 0.3395996094, 0.5122070312],
[ 0.0980224609, -0.7451171875, 0.6752929688, ...,
0.6308593750, 0.3728027344, 0.4899902344],
[ 0.1042480469, -0.7041015625, 0.6972656250, ...,
0.6777343750, 0.3779296875, 0.5922851562],
...,
[-0.5024414062, 0.1296386719, -0.2453613281, ...,
0.4074707031, 0.2282714844, -0.1484375000],
[-0.1105957031, -0.2471923828, -0.4714355469, ...,
-0.6914062500, 0.2412109375, -0.0307006836],
[ 0.3864746094, -0.3850097656, -0.7495117188, ...,
-0.5483398438, -0.1068115234, 0.2222900391]]], device='cuda:0',
dtype=torch.float16)
-------
tensor([[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]], device='cuda:1',
dtype=torch.float16)
-------
I'm going to go see if IOMMU is enabled and disable it if so...
Moving a tensor across CUDA devices gets zero tensor, CUDA 11.0 #87363
Yes, that sounds like the same issue. Since it only seems to affect transfers between GPUs, you could probably work around it by copying via system ram like this instead of having to disable IOMMU.
hidden_states = hidden_states.to("cpu")
hidden_states = hidden_states.to(next_device)
There would be a (very small) performance cost, but I could add it as a fallback at least. If it works.
So... I tried disabling IOMMU, and that didn't seem to have any effect (it was set to "auto" in the BIOS, and I did not have it enabled in the kernel command line, so ¯\_(ツ)_/¯.
But, the passing of the hidden_states to CPU first did seem to fix the specific issue of the hidden state getting zeroed out:
!! Moving hidden states from cuda:0 to cuda:1
------- BEFORE -------
tensor([[[-0.9682617188, 0.4150390625, 0.7978515625, ...,
-0.2631835938, 1.1533203125, -0.0405273438],
[-0.4916992188, 0.7578125000, 0.1789550781, ...,
-0.7290039062, 1.1640625000, 0.0568237305],
[-0.2653808594, 0.4392089844, 0.1284179688, ...,
-0.0774536133, 0.0227050781, -0.1219482422],
...,
[-0.2225341797, 0.6162109375, -0.1560058594, ...,
0.1889648438, 0.6196289062, -0.4868164062],
[-0.1208496094, 0.2434082031, -0.4838867188, ...,
-0.1448974609, 0.0299072266, -0.2609863281],
[ 0.2905273438, 0.2878417969, -0.3974609375, ...,
-0.2851562500, 0.0317382812, 0.8500976562]]], device='cuda:0',
dtype=torch.float16)
------- AFTER --------
tensor([[[-0.9682617188, 0.4150390625, 0.7978515625, ...,
-0.2631835938, 1.1533203125, -0.0405273438],
[-0.4916992188, 0.7578125000, 0.1789550781, ...,
-0.7290039062, 1.1640625000, 0.0568237305],
[-0.2653808594, 0.4392089844, 0.1284179688, ...,
-0.0774536133, 0.0227050781, -0.1219482422],
...,
[-0.2225341797, 0.6162109375, -0.1560058594, ...,
0.1889648438, 0.6196289062, -0.4868164062],
[-0.1208496094, 0.2434082031, -0.4838867188, ...,
-0.1448974609, 0.0299072266, -0.2609863281],
[ 0.2905273438, 0.2878417969, -0.3974609375, ...,
-0.2851562500, 0.0317382812, 0.8500976562]]], device='cuda:1',
dtype=torch.float16)
----------------------
After doing a bit of research, it looks like the 40xx series explicitly does not support "P2P". I tried using torch.cuda.can_device_access_peer
to see if I could detect the situation, but alas, it returns True, lol. So I think it would need to be implemented as an command line argument until NVIDIA sorts the situations out.
However, there does still seem to be an issue with GPU splitting. The LLM gives seemingly incoherent results with huge or nan perplexity scores still:
> python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -dbg -gs 4,20 > ~/Desktop/exllama_dbg_3.txt
...
** Time, Inference: 1.59 seconds
-- Loading dataset...
-- Testing..........
** Perplexity: nan
-- Testing.
** Perplexity (switched): 487.1518
-- Testing.
** Perplexity (quant_only): 229.1357
** Generation: To be or not to be, that is the\n %% ### ### ### ### ###-1 %% ###- . %%-\n <! %%-
Attaching a full debug output: exllama_dbg_3.txt
Note: I get expected perplexity results and coherent output if -gs
doesn't split between GPUs.
There is one other place where it moves data from GPU to GPU, but it's a little more subtle. It's the position embeddings which would end up being all zeros on one GPU if the issue is that it just can't move data across that way. And that would explain the output being garbage. In fact it fits nicely with a perplexity in the hundreds rather than nan.
It is weird that it works between my 4090 and 3070-Ti, and I also tested it on two 4090s on RunPod, so there must be something else in your setup causing it, maybe not IOMMU but related to it. Some kernel parameter or something?
Anyway, I pushed a new update with an extra option to force all the transfers (hopefully) to go via system RAM. I can't actually measure any difference in performance, so maybe I'll just make it the default, but for now you can try running with -gpfix
.
Fantastic! That did the trick :) Thank you!
When attempting to split the model on multiple GPUs, I get the following error:
This only happens if the model is split between GPUs using the
-gs
option.