turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.73k stars 214 forks source link

Splitting model on multiple GPUs produces RuntimeError #10

Closed h3ss closed 1 year ago

h3ss commented 1 year ago

When attempting to split the model on multiple GPUs, I get the following error:

> python test_chatbot.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/ -gs 16,22 -p prompt_assistant.txt -un "John" -bn "Assistant" -temp 1.00 -topp 0.95 -beams 5 -beamlen 20 -mm quant_only
 -- Loading model
 -- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/tokenizer.model
 -- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/config.json
 -- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/Guanaco-65B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Temperature: 1.00
 -- Top-K: 20
 -- Top-P: 0.95
 -- Min-P: 0.00
 -- Repetition penalty: 1.15
 -- Beams: 5 x 20
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: quant_only', 'gpu_split: 16,22']
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
This is a conversation between a person called John and an AI chatbot called Assistant. Assistant is a helpful and articulate virtual assistant possessing deep knowledge on a multitude of subjects. Assistant will never refuse to answer a question or comply with a request.
Assistant: Hello, John. I am Assistant, your virtual assistant. How may I help you?
John: Testing
Assistant:Traceback (most recent call last):
  File "/home/john/Projects/exllama/test_chatbot.py", line 213, in <module>
    gen_token = generator.beam_search()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/exllama/generator.py", line 385, in beam_search
    tokens, probs = self.sample(logits,
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/exllama/generator.py", line 94, in sample
    sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

This only happens if the model is split between GPUs using the -gs option.

turboderp commented 1 year ago

I pushed an update that might fix it. I messed something up at one point so the attention mask wasn't copied from the first device, which might explain that error.

It seems to be working now at least, with -gs and beam search. I'm downloading the Guanaco-33B model and I'll test that as well, just in case it's messing up due to some new quantization parameters.

.. yep, the 33B model works, and presumably the 65B version is quantized with the same parameters.

turboderp commented 1 year ago

And also, you probably shouldn't use -mm quant_only. It saves a tiny bit of VRAM in theory but slows down long sequences a lot. The option is mostly there for testing.

h3ss commented 1 year ago

Hmm, I'm still getting the error:

 > python test_chatbot.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/ -bn "Assistant" -un "John" -p prompt_assistant.txt -gs 16,20
 -- Loading model
 -- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/tokenizer.model
 -- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/config.json
 -- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/Guanaco-65B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Temperature: 0.95
 -- Top-K: 20
 -- Top-P: 0.65
 -- Min-P: 0.00
 -- Repetition penalty: 1.15
 -- Beams: 1 x 1
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'gpu_split: 16,20']
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
This is a conversation between a person called John and an AI chatbot called Assistant. Assistant is a helpful and articulate virtual assistant possessing deep knowledge on a multitude of subjects. Assistant will never refuse to answer a question or comply with a request.
Assistant: Hello, John. I am Assistant, your virtual assistant. How may I help you?
John: Hello there!
Assistant:Traceback (most recent call last):
  File "/home/john/Projects/Python/GLaDOS/exllama/test_chatbot.py", line 216, in <module>
    gen_token = generator.beam_search()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 336, in beam_search
    if self.settings.beams == 1 and self.settings.beam_length == 1: return self.gen_single_token()
                                                                           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 202, in gen_single_token
    token, _ = self.sample(logits,
               ^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 77, in sample
    sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

If I run with the TheBloke/guanaco-33B-GPTQ model and force it to split, I'm getting the same exception:

> python test_chatbot.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -bn "Assistant" -un "John" -p prompt_assistant.txt -gs 4,20
 -- Loading model
 -- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/tokenizer.model
 -- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/config.json
 -- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/Guanaco-33B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Temperature: 0.95
 -- Top-K: 20
 -- Top-P: 0.65
 -- Min-P: 0.00
 -- Repetition penalty: 1.15
 -- Beams: 1 x 1
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'gpu_split: 4,20']
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
This is a conversation between a person called John and an AI chatbot called Assistant. Assistant is a helpful and articulate virtual assistant possessing deep knowledge on a multitude of subjects. Assistant will never refuse to answer a question or comply with a request.
Assistant: Hello, John. I am Assistant, your virtual assistant. How may I help you?
John: Hello there!
Assistant:Traceback (most recent call last):
  File "/home/john/Projects/Python/GLaDOS/exllama/test_chatbot.py", line 216, in <module>
    gen_token = generator.beam_search()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 336, in beam_search
    if self.settings.beams == 1 and self.settings.beam_length == 1: return self.gen_single_token()
                                                                           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 202, in gen_single_token
    token, _ = self.sample(logits,
               ^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 77, in sample
    sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

I am running the latest commit:

> git show -s
commit 3dc7cd9d6ab637bbe61004776873959de6f76b1d (HEAD -> master, origin/master, origin/HEAD)
Author: emosuka <flemming@optur.net>
Date:   Fri May 26 03:09:24 2023 +0200

    Some tidying up, a few fixes and improvements
turboderp commented 1 year ago

It's odd. Could you try this?

python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -gs 4,20

Also, what GPUs are you using?

h3ss commented 1 year ago

I get the same error eventually, after some nan results for perplexity evaluation:

 python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -gs 4,20
 -- Loading model
 -- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/tokenizer.model
 -- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/config.json
 -- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/Guanaco-33B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'perplexity', 'validate', 'gpu_split: 4,20']
 ** Time, Load model: 3.83 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 4,142.07 MB - [cuda:1] 11,797.71 MB
 -- Loading dataset...
 -- Testing..........
 ** Perplexity: nan
 -- Testing.
 ** Perplexity (switched): nan
 -- Testing.
 ** Perplexity (quant_only): nan
Traceback (most recent call last):
  File "/home/john/Projects/Python/GLaDOS/exllama/test_benchmark_inference.py", line 301, in <module>
    text = generator.generate_simple("To be or not to be, that is the", max_new_tokens = 20)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 179, in generate_simple
    token = self.gen_single_token()
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 202, in gen_single_token
    token, _ = self.sample(logits,
               ^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 77, in sample
    sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

My GPUs are both RTX 4090:

 nvidia-smi
Thu May 25 19:11:14 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         Off| 00000000:01:00.0  On |                  Off |
| 30%   31C    P5               59W / 450W|   2567MiB / 24564MiB |     37%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090         Off| 00000000:16:00.0 Off |                  Off |
| 30%   29C    P8               30W / 450W|      6MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1197      G   /usr/lib/Xorg                              1074MiB |
|    0   N/A  N/A      1347      G   /usr/bin/kwin_x11                           225MiB |
|    0   N/A  N/A      1388      G   /usr/bin/plasmashell                         83MiB |
|    0   N/A  N/A      1992      G   /usr/bin/akonadi_archivemail_agent            7MiB |
|    0   N/A  N/A      2010      G   /usr/bin/akonadi_mailfilter_agent             7MiB |
|    0   N/A  N/A      2015      G   /usr/bin/akonadi_sendlater_agent              7MiB |
|    0   N/A  N/A      2016      G   /usr/bin/akonadi_unifiedmailbox_agent        93MiB |
|    0   N/A  N/A      2460      G   kitty                                        39MiB |
|    0   N/A  N/A      5185      G   ...ures=SpareRendererForSitePerProcess       49MiB |
|    0   N/A  N/A      5378      G   ...ures=SpareRendererForSitePerProcess       36MiB |
|    0   N/A  N/A      5881      G   /usr/bin/alacritty                           15MiB |
|    0   N/A  N/A      5964      G   ...sion,SpareRendererForSitePerProcess       23MiB |
|    0   N/A  N/A     15738      G   ...90172922,6867435081580918208,262144       26MiB |
|    0   N/A  N/A     61652      G   /usr/lib/firefox/firefox                    663MiB |
|    0   N/A  N/A    265605      G   /usr/bin/alacritty                           15MiB |
|    1   N/A  N/A      1197      G   /usr/lib/Xorg                                 4MiB |
+---------------------------------------------------------------------------------------+
h3ss commented 1 year ago

For comparison, if the model is not split, test completes successfully, with normal perplexity scores.

 python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -gs 0,20
 -- Loading model
 -- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/tokenizer.model
 -- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/config.json
 -- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/Guanaco-33B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'perplexity', 'validate', 'gpu_split: 0,20']
 ** Time, Load model: 3.35 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 0.00 MB - [cuda:1] 15,938.78 MB
 -- Loading dataset...
 -- Testing..........
 ** Perplexity: 4.9239
 -- Testing.
 ** Perplexity (switched): 4.1771
 -- Testing.
 ** Perplexity (quant_only): 4.1647
 ** Generation: To be or not to be, that is the question.\nThe answer is: it depends on what you want to do with your life. If
turboderp commented 1 year ago

Huh... this is very odd indeed. It seems like it can move the state from GPU to GPU (since that last example is running on cuda:1). And either GPU works, I take it? I.e. if you run it with ´-gs 20,0´ do you get the same result as ´-gs 0,20´?

If that works then it has to come down to something particular that happens if the state is transferred in between decoder blocks. Possibly a synchronization issue with PyTorch. What version are you using? Also could you try ´nvcc --version´ for good measure?

I'll try a RunPod instance in a bit with two 4090s to see if it's maybe a timing issue that's masked by one of my GPUs being slower than the other.

h3ss commented 1 year ago

Earlier I had been using PyTorch 2.0.1, but I just switched to 2.1 nightly and still getting the same error:

 pip list | grep orch
pytorch-triton           2.1.0+7d1a95b046
torch                    2.1.0.dev20230525+cu118
torchaudio               2.1.0.dev20230525+cu118
torchvision              0.16.0.dev20230525+cu118
 python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -gs 4,20
 -- Loading model
 -- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/tokenizer.model
 -- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/config.json
 -- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/Guanaco-33B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'perplexity', 'validate', 'gpu_split: 4,20']
 ** Time, Load model: 3.87 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 4,142.07 MB - [cuda:1] 11,797.71 MB
 -- Loading dataset...
 -- Testing..........
 ** Perplexity: nan
 -- Testing.
 ** Perplexity (switched): nan
 -- Testing.
 ** Perplexity (quant_only): nan
Traceback (most recent call last):
  File "/home/john/Projects/Python/GLaDOS/exllama/test_benchmark_inference.py", line 301, in <module>
    text = generator.generate_simple("To be or not to be, that is the", max_new_tokens = 20)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 179, in generate_simple
    token = self.gen_single_token()
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 202, in gen_single_token
    token, _ = self.sample(logits,
               ^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/Python/GLaDOS/exllama/generator.py", line 77, in sample
    sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

If I do -gs 20,0 (or just don't specify -gs) it also works:

 python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -gs 20,0
 -- Loading model
 -- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/tokenizer.model
 -- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/config.json
 -- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/Guanaco-33B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'perplexity', 'validate', 'gpu_split: 20,0']
 ** Time, Load model: 2.48 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 15,936.28 MB - [cuda:1] 0.00 MB
 -- Loading dataset...
 -- Testing..........
 ** Perplexity: 4.9241
 -- Testing.
 ** Perplexity (switched): 4.1782
 -- Testing.
 ** Perplexity (quant_only): 4.1643
 ** Generation: To be or not to be, that is the question.\nThe answer is: it depends on what you want to do with your life. If

Thank you very much for going the extra mile to repro on RunPod! In case it helps, here's a bunch more info on my system:

 inxi -GCb                                                                                                                        (glados)    master  
System:
  Host: nous Kernel: 6.3.4-arch1-1 arch: x86_64 bits: 64 Console: pty pts/3 Distro: Arch Linux
Machine:
  Type: Desktop System: Micro-Star product: MS-7E12 v: 1.0 serial: <superuser required>
  Mobo: Micro-Star model: MAG X670E TOMAHAWK WIFI (MS-7E12) v: 1.0 serial: <superuser required>
    UEFI: American Megatrends LLC. v: 1.37 date: 05/15/2023
Battery:
  ID-1: hidpp_battery_0 charge: 100% condition: N/A
CPU:
  Info: 16-core model: AMD Ryzen 9 7950X bits: 64 type: MT MCP cache: L2: 16 MiB
  Speed (MHz): avg: 3126 min/max: 3000/5880 cores: 1: 4494 2: 3000 3: 2795 4: 3000 5: 2855
    6: 2879 7: 3000 8: 3000 9: 3000 10: 3000 11: 3000 12: 3000 13: 3000 14: 3000 15: 3000 16: 3000
    17: 3599 18: 3000 19: 2851 20: 2852 21: 3000 22: 3000 23: 3000 24: 4500 25: 3000 26: 4347
    27: 2881 28: 3000 29: 3000 30: 2999 31: 3000 32: 3000
Graphics:
  Device-1: NVIDIA AD102 [GeForce RTX 4090] driver: nvidia v: 530.41.03
  Device-2: NVIDIA AD102 [GeForce RTX 4090] driver: nvidia v: 530.41.03
  Device-3: AMD Raphael driver: amdgpu v: kernel
  Device-4: Logitech Webcam C930e driver: snd-usb-audio,uvcvideo type: USB
  Display: x11 server: X.org v: 1.21.1.8 with: Xwayland v: 23.1.1 driver: X: loaded: nvidia
    gpu: nvidia,nvidia-nvswitch tty: 213x42 resolution: 1: 3840x2160 2: 2560x2880 3: 1920x1080
  API: OpenGL Message: GL data unavailable in console. Try -G --display
Network:
  Device-1: Realtek RTL8125 2.5GbE driver: r8169
  Device-2: MEDIATEK MT7922 802.11ax PCI Express Wireless Network Adapter driver: mt7921e
Drives:
  Local Storage: total: 5.46 TiB used: 2.17 TiB (39.7%)
Info:
  Processes: 613 Uptime: 8h 14m Memory: available: 187.86 GiB used: 8.94 GiB (4.8%) Init: systemd
  Shell: fish inxi: 3.3.27
h3ss commented 1 year ago

Oh, and here's the nvcc --version:

 nvcc --version                                                                                                                   (glados)    master  
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
turboderp commented 1 year ago

RunPod doesn't seem to provide the latest drivers for their 4090 servers, which means no compute_89 so I can't precisely replicate your setup. They won't let you update drivers as far as I know (?)... It does compiles for compute_86 though (with cu117) and this is running fine. It's a bit slow for the smaller models, probably because of the CPU bottleneck (1500 MHz server cores), but the performance is not that bad on 65B:

# python test_benchmark_inference.py -d test2 -p -gs 16,16
 -- Loading model
 -- Tokenizer: test2/tokenizer.model
 -- Model config: test2/config.json
 -- Model: test2/Guanaco-65B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'perf', 'gpu_split: 16,16']
 ** Time, Load model: 8.59 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 16,222.84 MB - [cuda:1] 15,180.43 MB
 -- Inference, first pass.
 ** Time, Inference: 3.23 seconds
 ** Speed: 594.40 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 19.26 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 19.55 tokens/second
 ** VRAM, Inference: [cuda:0] 3,723.17 MB - [cuda:1] 3,464.65 MB
 ** VRAM, Total: [cuda:0] 19,946.01 MB - [cuda:1] 18,645.08 MB
# python test_chatbot.py -d test2 -gs 16,16 -nnl
 -- Loading model
 -- Tokenizer: test2/tokenizer.model
 -- Model config: test2/config.json
 -- Model: test2/Guanaco-65B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Temperature: 0.95
 -- Top-K: 20
 -- Top-P: 0.65
 -- Min-P: 0.00
 -- Repetition penalty: 1.15
 -- Beams: 1 x 1
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'mlp: normal', 'no_newline', 'gpu_split: 16,16']
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
Chatbort: Hello, User
User: Hello Chatbort
Chatbort: How can I help you?
User: Tell me a joke that isn't about a tomato.
Chatbort: What is the most confused day of the year? Answer: April 1st because it is April Fools Day and it falls on different days every year! This one isn't about a tomato. Is there anything else I can assist with?

It works with at least a couple of different versions of Torch and CUDA, though I still can't replicate your setup because of the driver version. 525 only supports up to 12.0 as far as I can tell.

But I'm wondering if it could have something to do with the AMD stuff in your device list. Maybe it's confusing Torch? You could try setting CUDA_VISIBLE_DEVICES in your env.

I think later I'll add a debug mode that dumps some I'd need to figure out exactly where the hidden state is getting corrupted.

turboderp commented 1 year ago

I added the debug mode. If you can try it with -dbg along with -gs and show me the output, it might help at least figure out where it's losing the plot.

h3ss commented 1 year ago

Thanks! Here you go:

> CUDA_VISIBLE_DEVICES=0,1 python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -dbg -gs 4,20 &> ~/Desktop/exllama_dbg.txt

exllama_dbg.txt

turboderp commented 1 year ago

Okay, so the hidden states are just disappearing when jumping from GPU to GPU? That's super weird. Could you try printing out the hidden state before and after the move. In model.py on line 1103, replace this:

            next_device = self.config.device_map.layers[i]
            if device != next_device:
                if self.config.debug: print(f" !! Moving hidden states from {device} to {next_device}")
                hidden_states = hidden_states.to(next_device)
                device = next_device

with this:

            next_device = self.config.device_map.layers[i]
            if device != next_device:
                if self.config.debug: print(f" !! Moving hidden states from {device} to {next_device}")
                print("-------")
                print(hidden_states)
                print("-------")
                hidden_states = hidden_states.to(next_device)
                print(hidden_states)
                print("-------")
                device = next_device

If this is where the contents disappear I really don't know what to think...

h3ss commented 1 year ago

So it would seem... Just made that patch to model.py and here is the result:

exllama_dbg_2.txt

And the relevant section:

 !! Moving hidden states from cuda:0 to cuda:1
-------
tensor([[[ 0.0593261719, -0.7446289062,  0.7128906250,  ...,
           0.6777343750,  0.3395996094,  0.5122070312],
         [ 0.0980224609, -0.7451171875,  0.6752929688,  ...,
           0.6308593750,  0.3728027344,  0.4899902344],
         [ 0.1042480469, -0.7041015625,  0.6972656250,  ...,
           0.6777343750,  0.3779296875,  0.5922851562],
         ...,
         [-0.5024414062,  0.1296386719, -0.2453613281,  ...,
           0.4074707031,  0.2282714844, -0.1484375000],
         [-0.1105957031, -0.2471923828, -0.4714355469,  ...,
          -0.6914062500,  0.2412109375, -0.0307006836],
         [ 0.3864746094, -0.3850097656, -0.7495117188,  ...,
          -0.5483398438, -0.1068115234,  0.2222900391]]], device='cuda:0',
       dtype=torch.float16)
-------
tensor([[[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]], device='cuda:1',
       dtype=torch.float16)
-------
h3ss commented 1 year ago

I'm going to go see if IOMMU is enabled and disable it if so...

Moving a tensor across CUDA devices gets zero tensor, CUDA 11.0 #87363

turboderp commented 1 year ago

Yes, that sounds like the same issue. Since it only seems to affect transfers between GPUs, you could probably work around it by copying via system ram like this instead of having to disable IOMMU.

                hidden_states = hidden_states.to("cpu")
                hidden_states = hidden_states.to(next_device)  

There would be a (very small) performance cost, but I could add it as a fallback at least. If it works.

h3ss commented 1 year ago

So... I tried disabling IOMMU, and that didn't seem to have any effect (it was set to "auto" in the BIOS, and I did not have it enabled in the kernel command line, so ¯\_(ツ)_/¯.

But, the passing of the hidden_states to CPU first did seem to fix the specific issue of the hidden state getting zeroed out:

 !! Moving hidden states from cuda:0 to cuda:1
------- BEFORE -------
tensor([[[-0.9682617188,  0.4150390625,  0.7978515625,  ...,
          -0.2631835938,  1.1533203125, -0.0405273438],
         [-0.4916992188,  0.7578125000,  0.1789550781,  ...,
          -0.7290039062,  1.1640625000,  0.0568237305],
         [-0.2653808594,  0.4392089844,  0.1284179688,  ...,
          -0.0774536133,  0.0227050781, -0.1219482422],
         ...,
         [-0.2225341797,  0.6162109375, -0.1560058594,  ...,
           0.1889648438,  0.6196289062, -0.4868164062],
         [-0.1208496094,  0.2434082031, -0.4838867188,  ...,
          -0.1448974609,  0.0299072266, -0.2609863281],
         [ 0.2905273438,  0.2878417969, -0.3974609375,  ...,
          -0.2851562500,  0.0317382812,  0.8500976562]]], device='cuda:0',
       dtype=torch.float16)
------- AFTER --------
tensor([[[-0.9682617188,  0.4150390625,  0.7978515625,  ...,
          -0.2631835938,  1.1533203125, -0.0405273438],
         [-0.4916992188,  0.7578125000,  0.1789550781,  ...,
          -0.7290039062,  1.1640625000,  0.0568237305],
         [-0.2653808594,  0.4392089844,  0.1284179688,  ...,
          -0.0774536133,  0.0227050781, -0.1219482422],
         ...,
         [-0.2225341797,  0.6162109375, -0.1560058594,  ...,
           0.1889648438,  0.6196289062, -0.4868164062],
         [-0.1208496094,  0.2434082031, -0.4838867188,  ...,
          -0.1448974609,  0.0299072266, -0.2609863281],
         [ 0.2905273438,  0.2878417969, -0.3974609375,  ...,
          -0.2851562500,  0.0317382812,  0.8500976562]]], device='cuda:1',
       dtype=torch.float16)
----------------------

After doing a bit of research, it looks like the 40xx series explicitly does not support "P2P". I tried using torch.cuda.can_device_access_peer to see if I could detect the situation, but alas, it returns True, lol. So I think it would need to be implemented as an command line argument until NVIDIA sorts the situations out.

However, there does still seem to be an issue with GPU splitting. The LLM gives seemingly incoherent results with huge or nan perplexity scores still:

> python test_benchmark_inference.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-33B-GPTQ/ -v -ppl -dbg -gs 4,20 > ~/Desktop/exllama_dbg_3.txt

...

 ** Time, Inference: 1.59 seconds
 -- Loading dataset...
 -- Testing..........
 ** Perplexity: nan
 -- Testing.
 ** Perplexity (switched): 487.1518
 -- Testing.
 ** Perplexity (quant_only): 229.1357
 ** Generation: To be or not to be, that is the\n  %% ### ### ### ### ###-1 %% ###- . %%-\n <! %%-

Attaching a full debug output: exllama_dbg_3.txt

Note: I get expected perplexity results and coherent output if -gs doesn't split between GPUs.

turboderp commented 1 year ago

There is one other place where it moves data from GPU to GPU, but it's a little more subtle. It's the position embeddings which would end up being all zeros on one GPU if the issue is that it just can't move data across that way. And that would explain the output being garbage. In fact it fits nicely with a perplexity in the hundreds rather than nan.

It is weird that it works between my 4090 and 3070-Ti, and I also tested it on two 4090s on RunPod, so there must be something else in your setup causing it, maybe not IOMMU but related to it. Some kernel parameter or something?

Anyway, I pushed a new update with an extra option to force all the transfers (hopefully) to go via system RAM. I can't actually measure any difference in performance, so maybe I'll just make it the default, but for now you can try running with -gpfix.

h3ss commented 1 year ago

Fantastic! That did the trick :) Thank you!