johnsmith0031 / alpaca_lora_4bit

MIT License
533 stars 84 forks source link

Inf or NaN in probabilities. Windows 10, vicuna-7b-gptq-4bit-128g #124

Closed alex4321 closed 1 year ago

alex4321 commented 1 year ago

I have a Windows 10 machine, where I am trying to run some vicuna-based scripting now.

So I did the following stuff:

  1. created conda environment
    conda create --name llama-memory python=3.10
    conda activate llama-memory
    conda install pytorch torchvision torchaudio pytorch-cuda=11.7 cudatoolkit-dev -c pytorch -c nvidia -c conda-forge
  2. Than:
    pip install git+https://github.com/johnsmith0031/alpaca_lora_4bit@winglian-setup_pip

so far everything went fine since I had both nvcc from cudatoolkit-dev package and VC compilers

  1. Than I tried to run python (at first I caught the issue using my own wrapper library on top of your monkeypathing stuff, so now I repeating it using purely your library)
    
    Python 3.10.11 | packaged by Anaconda, Inc. | (main, May 16 2023, 00:55:32) [MSC v.1916 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram, Autograd4bitQuantLinear
    Triton not found. Please run "pip install triton".

model, tokenizer = load_llama_model_4bit_low_ram("vicuna-7B-GPTQ-4bit-128g", "vicuna-7B-GPTQ-4bit-128g/vicuna-7B-GPTQ-4bit-128g.safetensors", groupsize=128) Loading Model ... The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function. The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function. The safetensors archive passed at vicuna-7B-GPTQ-4bit-128g/vicuna-7B-GPTQ-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the save_pretrained method. Defaulting to 'pt' metadata. Loaded the model in 4.32 seconds.

model.half() for n, m in model.named_modules(): ... if isinstance(m, Autograd4bitQuantLinear): ... if m.is_v1_model: ... m.zeros = m.zeros.half() ... m.scales = m.scales.half() ... m.bias = m.bias.half() ...

from alpaca_lora_4bit.amp_wrapper import AMPWrapper wrapper = AMPWrapper(model) wrapper.apply_generate()

prompt = '''I think the meaning of life is''' batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=False) batch = {k: v.cuda() for k, v in batch.items()}

import torch with torch.no_grad(): ... generated = model.generate(inputs=batch["input_ids"], ... do_sample=True, use_cache=True, ... repetition_penalty=1.1, ... max_new_tokens=20, ... temperature=0.9, ... top_p=0.95, ... top_k=40, ... return_dict_in_generate=True, ... output_attentions=False, ... output_hidden_states=False, ... output_scores=False) ... Traceback (most recent call last): File "", line 2, in File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\alpaca_lora_4bit\amp_wrapper.py", line 18, in autocast_generate return self.model.non_autocast_generate(*args, *kwargs) File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\transformers\generation\utils.py", line 1572, in generate return self.sample( File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\transformers\generation\utils.py", line 2655, in sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either inf, nan or element < 0

  1. Here my info: nvidia-smi output:
    
    Wed Jun 21 06:01:11 2023
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 536.23                 Driver Version: 536.23       CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA GeForce RTX 2080 Ti   WDDM  | 00000000:06:00.0  On |                  N/A |
    |  0%   46C    P8              21W / 250W |    971MiB / 11264MiB |      2%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2856 C+G ...m Files\Mozilla Firefox\firefox.exe N/A | | 0 N/A N/A 5624 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 7124 C+G ...siveControlPanel\SystemSettings.exe N/A | | 0 N/A N/A 7152 C+G ....Search_cw5n1h2txyewy\SearchApp.exe N/A | | 0 N/A N/A 8332 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 9564 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A | | 0 N/A N/A 9872 C+G ....Search_cw5n1h2txyewy\SearchApp.exe N/A | | 0 N/A N/A 10252 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | | 0 N/A N/A 10968 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 11140 C+G ...al\Discord\app-1.0.9013\Discord.exe N/A | | 0 N/A N/A 11384 C+G ...m Files\Mozilla Firefox\firefox.exe N/A | | 0 N/A N/A 12372 C+G ...crosoft\Edge\Application\msedge.exe N/A | | 0 N/A N/A 12520 C+G ....Search_cw5n1h2txyewy\SearchApp.exe N/A | | 0 N/A N/A 13880 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A | +---------------------------------------------------------------------------------------+

nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_19:00:59_Pacific_Daylight_Time_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0

python --version

Python 3.10.11

alex4321 commented 1 year ago

Hm. Reviewing the message after posting I catched one issue:

nvcc: NVIDIA (R) Cuda compiler driver...
Cuda compilation tools, release 11.7, V11.7.64

but:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.23                 Driver Version: 536.23       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+

so now I will try to recreate environment specifying cudatoolkit 11.7 especially (to use same one as nvcc)

johnsmith0031 commented 1 year ago

What is the format of the model? If you are using the model with act order, you should set:

import matmul_utils_4bit
matmul_utils_4bit.act_order = True
alex4321 commented 1 year ago

Okay, attempt to recreate environment gave me the following result

conda create --name llama-memory python=3.10
conda activate llama-memory
REM faster implementation of the same package-managing logic as conda
conda install -c conda-forge mamba
REM the following installation is quite long, but that's offtopic
mamba install pytorch torchvision torchaudio pytorch-cuda=11.7 cudatoolkit=11.7.0 cudatoolkit-dev=11.7.0 -c pytorch -c nvidia -c conda-forge
nvidia-smi
REM still gave me 12.2
nvcc --version
REM 11.7

Which is strange, but now I will check act_order stuff

alex4321 commented 1 year ago

@johnsmith0031 tried that too:

>>> import alpaca_lora_4bit.matmul_utils_4bit
Triton not found. Please run "pip install triton".
>>> alpaca_lora_4bit.matmul_utils_4bit.act_order = True
>>>
>>> from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram, Autograd4bitQuantLinear
>>> model, tokenizer = load_llama_model_4bit_low_ram("vicuna-7B-GPTQ-4bit-128g", "vicuna-7B-GPTQ-4bit-128g/vicuna-7B-GPTQ-4bit-128g.safetensors", groupsize=128)
Loading Model ...
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The safetensors archive passed at vicuna-7B-GPTQ-4bit-128g/vicuna-7B-GPTQ-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
Loaded the model in 4.23 seconds.
>>>
>>> model.half();
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32001, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Autograd4bitQuantLinear()
          (k_proj): Autograd4bitQuantLinear()
          (v_proj): Autograd4bitQuantLinear()
          (o_proj): Autograd4bitQuantLinear()
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Autograd4bitQuantLinear()
          (down_proj): Autograd4bitQuantLinear()
          (up_proj): Autograd4bitQuantLinear()
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32001, bias=False)
)
>>> for n, m in model.named_modules():
...     if isinstance(m, Autograd4bitQuantLinear):
...         if m.is_v1_model:
...             m.zeros = m.zeros.half()
...         m.scales = m.scales.half()
...         m.bias = m.bias.half()
...
>>>
>>> from alpaca_lora_4bit.amp_wrapper import AMPWrapper
>>> wrapper = AMPWrapper(model)
>>> wrapper.apply_generate()
>>>
>>>
>>> prompt = '''I think the meaning of life is'''
>>> batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
>>> batch = {k: v.cuda() for k, v in batch.items()}
>>>
>>>
>>> import torch
>>> with torch.no_grad():
...      generated = model.generate(inputs=batch["input_ids"],
...                                 do_sample=True, use_cache=True,
...                                 repetition_penalty=1.1,
...                                 max_new_tokens=20,
...                                 temperature=0.9,
...                                 top_p=0.95,
...                                 top_k=40,
...                                 return_dict_in_generate=True,
...                                 output_attentions=False,
...                                 output_hidden_states=False,
...                                 output_scores=False)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\alpaca_lora_4bit\amp_wrapper.py", line 18, in autocast_generate
    return self.model.non_autocast_generate(*args, **kwargs)
  File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\transformers\generation\utils.py", line 1572, in generate
    return self.sample(
  File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\transformers\generation\utils.py", line 2655, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Model is from that one: https://huggingface.co/TheBloke/vicuna-7B-GPTQ-4bit-128g

But I am not sure everything went okay with environment (see that different versions I mentioned) so I will recheck that stuff now.

alex4321 commented 1 year ago

Okay, I got it. nvidia-smi show a driver-supported version, not what I have inside environment, and that's supposed to be ok.

Continuing searching than.

alex4321 commented 1 year ago

Okay, I reinstalled everything from scratch (from GPU driver to CUDA and than conda environment / pytorch / alpaca_lora_4bit - so the kernel was recompiled) and even tried another model: https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g/tree/main

import alpaca_lora_4bit.matmul_utils_4bit
alpaca_lora_4bit.matmul_utils_4bit.act_order = False

from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram, Autograd4bitQuantLinear
model, tokenizer = load_llama_model_4bit_low_ram("vicuna-13B-1.1-GPTQ-4bit-128g", "vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order.pt", groupsize=128)

model.half();
for n, m in model.named_modules():
    if isinstance(m, Autograd4bitQuantLinear):
        if m.is_v1_model:
            m.zeros = m.zeros.half()
        m.scales = m.scales.half()
        m.bias = m.bias.half()

from alpaca_lora_4bit.amp_wrapper import AMPWrapper
wrapper = AMPWrapper(model)
wrapper.apply_generate()

prompt = '''I think the meaning of life is'''
batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
batch = {k: v.cuda() for k, v in batch.items()}

import torch

with torch.no_grad():
     generated = model.generate(inputs=batch["input_ids"],
                                do_sample=True, use_cache=True,
                                repetition_penalty=1.1,
                                max_new_tokens=20,
                                temperature=0.9,
                                top_p=0.95,
                                top_k=40,
                                return_dict_in_generate=True,
                                output_attentions=False,
                                output_hidden_states=False,
                                output_scores=False)

But it still gives the same issue:

Loading Model ...
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loaded the model in 6.61 seconds.
Traceback (most recent call last):
  File "C:\Users\alex4321\Documents\llama-memory\test.py", line 26, in <module>
    generated = model.generate(inputs=batch["input_ids"],
  File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\alpaca_lora_4bit\amp_wrapper.py", line 18, in autocast_generate
    return self.model.non_autocast_generate(*args, **kwargs)
  File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\transformers\generation\utils.py", line 1572, in generate
    return self.sample(
  File "C:\Users\alex4321\anaconda3\envs\llama-memory\lib\site-packages\transformers\generation\utils.py", line 2655, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
alex4321 commented 1 year ago

Due to https://github.com/johnsmith0031/alpaca_lora_4bit/commit/6dd9cd960fe6357cd2f1c86f691d4c0e3fcb1a9d I also tried to set:

alpaca_lora_4bit.matmul_utils_4bit.act_order = False
alpaca_lora_4bit.matmul_utils_4bit.faster_mode = 'disable' # Also tried 'faster' and 'old_faster'

but got the same result

alex4321 commented 1 year ago

The strange thing that it worked previously with +/- 1 week old alpaca_lora_4bit installation, but unfortunatelly now I can't reproduce the same environment (broken it while making one of mine fork fixes for this library), and trying reinstalling it from specific comments changed nothing.

johnsmith0031 commented 1 year ago

Strange... Maybe you can try the code on cloud server with the same process (e.g. using runpod) and check if it works, and what is wrong with your local env.

alex4321 commented 1 year ago

Yeah, I guess will retry things a bit later.

Besides, I also tried main branch (as I mentioned I used winglian-setup_pip - the same error remains here.

alex4321 commented 1 year ago

Sorry, previous message was incorrect. Removed, redoing checking.

alex4321 commented 1 year ago

Non-stable run results?

Hm. That's strange. Inside WSL:

test.py content

#import alpaca_lora_4bit.matmul_utils_4bit
#alpaca_lora_4bit.matmul_utils_4bit.act_order = False
#alpaca_lora_4bit.matmul_utils_4bit.faster_mode = 'disable'

from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram, Autograd4bitQuantLinear, switch_backend_to
switch_backend_to("cuda")

model, tokenizer = load_llama_model_4bit_low_ram("vicuna-13B-1.1-GPTQ-4bit-128g", "vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order.pt", groupsize=128)

model.half();
for n, m in model.named_modules():
    if isinstance(m, Autograd4bitQuantLinear):
        if m.is_v1_model:
            m.zeros = m.zeros.half()
        m.scales = m.scales.half()
        m.bias = m.bias.half()

model.tie_weights()

from alpaca_lora_4bit.amp_wrapper import AMPWrapper
wrapper = AMPWrapper(model)
wrapper.apply_generate()

prompt = '''I think the meaning of life is'''
batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
batch = {k: v.cuda() for k, v in batch.items()}

import torch

with torch.no_grad():
     generated = model.generate(inputs=batch["input_ids"],
                                do_sample=True, use_cache=True,
                                repetition_penalty=1.1,
                                max_new_tokens=1,
                                temperature=0.5,
                                top_p=0.97,
                                top_k=40,
                                return_dict_in_generate=True,
                                output_attentions=False,
                                output_hidden_states=False,
                                output_scores=False)
bash -c "yes | pip uninstall alpaca_lora_4bit" && pip install git+https://github.com/johnsmith0031/alpaca_lora_4bit@05f55c010a571dfb15fa9799e444d3c203429045 && python test.py

First run:

Loading Model ...
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loaded the model in 14.19 seconds.
(llama)

Second one:

Using CUDA implementation.
Loading Model ...
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loaded the model in 14.54 seconds.

Third one:

Using CUDA implementation.
Loading Model ...
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loaded the model in 14.62 seconds.
Traceback (most recent call last):
  File "/mnt/c/Users/alex4321/Documents/alpaca_lora_4bit/test.py", line 31, in <module>
    generated = model.generate(inputs=batch["input_ids"],
  File "/home/alex4321/miniconda3/envs/llama/lib/python3.10/site-packages/alpaca_lora_4bit/amp_wrapper.py", line 18, in autocast_generate
    return self.model.non_autocast_generate(*args, **kwargs)
  File "/home/alex4321/miniconda3/envs/llama/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/alex4321/miniconda3/envs/llama/lib/python3.10/site-packages/transformers/generation/utils.py", line 1574, in generate
    return self.sample(
  File "/home/alex4321/miniconda3/envs/llama/lib/python3.10/site-packages/transformers/generation/utils.py", line 2657, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

So - non-stable result.

Than I runned test.py again. Got the same issue.

Runned installation again (so 4th attempt):

bash -c "yes | pip uninstall alpaca_lora_4bit" && pip install git+https://github.com/johnsmith0031/alpaca_lora_4bit@05f55c010a571dfb15fa9799e444d3c203429045 && python test.py

No error:

Using CUDA implementation.
Loading Model ...
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loaded the model in 14.61 seconds.

python test.py

python test.py
Using CUDA implementation.
Loading Model ...
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loaded the model in 18.42 seconds.
Traceback (most recent call last):
  File "/mnt/c/Users/alex4321/Documents/alpaca_lora_4bit/test.py", line 31, in <module>
    generated = model.generate(inputs=batch["input_ids"],
  File "/home/alex4321/miniconda3/envs/llama/lib/python3.10/site-packages/alpaca_lora_4bit/amp_wrapper.py", line 18, in autocast_generate
    return self.model.non_autocast_generate(*args, **kwargs)
  File "/home/alex4321/miniconda3/envs/llama/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/alex4321/miniconda3/envs/llama/lib/python3.10/site-packages/transformers/generation/utils.py", line 1574, in generate
    return self.sample(
  File "/home/alex4321/miniconda3/envs/llama/lib/python3.10/site-packages/transformers/generation/utils.py", line 2657, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Error occurs. While before it the same compiled kernel (exactly from the same compilation) did not returned a error.

Hm, guess I will try some hardware checkups. Memory, especially, both RAM and GPU's.

alex4321 commented 1 year ago

Meanwhile inside the native windows, not the WSL I mentioned:

pip uninstall --yes alpaca_lora_4bit
pip install git+https://github.com/johnsmith0031/alpaca_lora_4bit@05f55c010a571dfb15fa9799e444d3c203429045
python test.py
Triton not found. Please run "pip install triton".
Using CUDA implementation.
Loading Model ...
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loaded the model in 6.29 seconds.
Traceback (most recent call last):
  File "C:\Users\alex4321\Documents\alpaca_lora_4bit\test.py", line 31, in <module>
    generated = model.generate(inputs=batch["input_ids"],
  File "C:\Users\alex4321\anaconda3\envs\llama\lib\site-packages\alpaca_lora_4bit\amp_wrapper.py", line 18, in autocast_generate
    return self.model.non_autocast_generate(*args, **kwargs)
  File "C:\Users\alex4321\anaconda3\envs\llama\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\alex4321\anaconda3\envs\llama\lib\site-packages\transformers\generation\utils.py", line 1574, in generate
    return self.sample(
  File "C:\Users\alex4321\anaconda3\envs\llama\lib\site-packages\transformers\generation\utils.py", line 2657, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Another attempt

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Third one

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Fourth one:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

So for commit 05f55c010a571dfb15fa9799e444d3c203429045 I have unstable results in WSL and stable failures in Windows.

Later today I will:

alex4321 commented 1 year ago

For 65e5495ce5ffcd669408ad5d3e5d6e0f71739e9f commit (latest state)

Inside WSL: 2 successful attempts, 2 failed

Inside Windows: 4 failures

Going to do hardware checkups and do the same for other environments

alex4321 commented 1 year ago

Most probably it's not about my hardware.

At least not WSL instability (unlike Windows - with WSL I got some successful runs, as well as failed). Not sure the reason, but still may be it's related issues.

Because I tried Google Colab and managed to repeat this instability: https://colab.research.google.com/drive/1Topc3AAYEbSesK6gzj_kQdrmYyoFcRbV?usp=sharing

Of course the environment differs (native Linux vs WSL or Windows, Tesla T4 instead of 2080Ti), but I guess that's most probable that at least WSL-Colab problems are the same.

So take a look at this, @johnsmith0031 , if you'll have a chance.

I will probably dive to this instability issue later too.

alex4321 commented 1 year ago

Okay, I changed my testing script:

#import alpaca_lora_4bit.matmul_utils_4bit
#alpaca_lora_4bit.matmul_utils_4bit.act_order = False
#alpaca_lora_4bit.matmul_utils_4bit.faster_mode = 'disable'

import torch
from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram, Autograd4bitQuantLinear, switch_backend_to
switch_backend_to("cuda")

def build_forward_check_nan_inf(name, module):
    def _func(*args, **kwargs):
        result = module._nan_check_old_forward(*args, **kwargs)
        tensors = []
        if isinstance(result, torch.Tensor):
            tensors = result
        if isinstance(result, tuple) or isinstance(result, list):
            tensors = [item for item in result if isinstance(item, torch.Tensor)]
        if isinstance(result, dict):
            tensors = [item for item in result.values() if isinstance(item, torch.Tensor)]
        for tensor in tensors:
            if torch.isinf(tensor).any().item():
                raise ValueError(f"Got Inf in {name} output")
            if torch.isnan(tensor).any().item():
                raise ValueError(f"Got NaN in {name} output")
        return result

    return _func

def wrap_forward_check_nan_inf(module, name):
    module._nan_check_old_forward = module.forward
    module.forward = build_forward_check_nan_inf(name, module)

model, tokenizer = load_llama_model_4bit_low_ram("vicuna-13B-1.1-GPTQ-4bit-128g", "vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order.pt", groupsize=128)

model.half();
for n, m in model.named_modules():
    if isinstance(m, Autograd4bitQuantLinear):
        if m.is_v1_model:
            m.zeros = m.zeros.half()
        m.scales = m.scales.half()
        m.bias = m.bias.half()

model.tie_weights()

wrap_forward_check_nan_inf(model.model.embed_tokens, "embed_tokens")
for i in range(40):
    wrap_forward_check_nan_inf(model.model.layers[i].self_attn.q_proj, f"Layer {i} self-attention q_proj")
    wrap_forward_check_nan_inf(model.model.layers[i].self_attn.k_proj, f"Layer {i} self-attention k_proj")
    wrap_forward_check_nan_inf(model.model.layers[i].self_attn.v_proj, f"Layer {i} self-attention v_proj")
    wrap_forward_check_nan_inf(model.model.layers[i].self_attn.o_proj, f"Layer {i} self-attention o_proj")
    wrap_forward_check_nan_inf(model.model.layers[i].self_attn.rotary_emb, f"Layer {i} self-attention rotary_emb")
    wrap_forward_check_nan_inf(model.model.layers[i].self_attn, f"Layer {i} self-attention itself")
    wrap_forward_check_nan_inf(model.model.layers[i].mlp.gate_proj, f"Layer {i} mlp gate_proj")
    wrap_forward_check_nan_inf(model.model.layers[i].mlp.down_proj, f"Layer {i} mlp down_proj")
    wrap_forward_check_nan_inf(model.model.layers[i].mlp.up_proj, f"Layer {i} mlp up_proj")
    wrap_forward_check_nan_inf(model.model.layers[i].mlp.act_fn, f"Layer {i} mlp act_fn")
    wrap_forward_check_nan_inf(model.model.layers[i].mlp, f"Layer {i} mlp itself")
    wrap_forward_check_nan_inf(model.model.layers[i].input_layernorm, f"Layer {i} input_layernorm")
    wrap_forward_check_nan_inf(model.model.layers[i].post_attention_layernorm, f"Layer {i} post_attention_layernorm")
wrap_forward_check_nan_inf(model.model.norm, "norm")
wrap_forward_check_nan_inf(model.lm_head, "lm_head")

from alpaca_lora_4bit.amp_wrapper import AMPWrapper
wrapper = AMPWrapper(model)
wrapper.apply_generate()

prompt = '''I think the meaning of life is'''
batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
batch = {k: v.cuda() for k, v in batch.items()}

import torch

with torch.no_grad():
     generated = model.generate(inputs=batch["input_ids"],
                                do_sample=True, use_cache=True,
                                repetition_penalty=1.1,
                                max_new_tokens=1,
                                temperature=0.5,
                                top_p=0.97,
                                top_k=40,
                                return_dict_in_generate=True,
                                output_attentions=False,
                                output_hidden_states=False,
                                output_scores=False)

So now I would see which module failed.

So far trying to run it on Windows machine (I will retry it with the same Colab notebook now) I got (from 5 different runs):

ValueError: Got NaN in Layer 0 self-attention o_proj output
ValueError: Got Inf in Layer 0 mlp down_proj output
ValueError: Got Inf in Layer 1 mlp gate_proj output
ValueError: Got Inf in Layer 0 self-attention q_proj output
ValueError: Got Inf in Layer 0 mlp gate_proj output

So all the errors related to 4-bit linear layers

alex4321 commented 1 year ago

Yeah, colab notebook gives me the similar result: https://colab.research.google.com/drive/1Topc3AAYEbSesK6gzj_kQdrmYyoFcRbV?usp=sharing

ValueError: Got Inf in Layer 0 self-attention q_proj output
ValueError: Got Inf in Layer 0 self-attention q_proj output
ValueError: Got Inf in Layer 0 mlp down_proj output

So on one hand:

So I guess it's about the difference in some OS-depending mechanism and the way it used.

Don't know exactly yet, but I will dive deep into debugging soon.

alex4321 commented 1 year ago

Lol, that's very strange.

I made a testing/debugging notebook: https://github.com/alex4321/alpaca_lora_4bit/blob/fix-nan-or-inf-after-linear/test.ipynb

The monkeypatch that triggers the asserting is here:

def _patched_linear_forward(self, x):
    assert not torch.isnan(x).any().item()
    assert not torch.isinf(x).any().item()
    if self.bits == 4:
        if torch.is_grad_enabled():
            out = autograd_4bit.AutogradMatmul4bit.apply(x, self.qweight, self.scales,
                                        self.qzeros if not self.is_v1_model else self.zeros,
                                        self.g_idx, self.bits, self.maxq)
            assert not torch.isnan(out).any().item()
            assert not torch.isinf(out).any().item()
        else:
            out = autograd_4bit.matmul4bit_with_backend(x, self.qweight, self.scales,
                                        self.qzeros if not self.is_v1_model else self.zeros,
                                        self.g_idx, self.bits, self.maxq, self.groupsize)
            assert not torch.isnan(out).any().item()
            assert not torch.isinf(out).any().item()
    elif self.bits == 2:
        raise NotImplementedError("Debugging 4-bit case")
        out = AutogradMatmul2bit.apply(x, self.qweight, self.scales, self.qzeros, self.g_idx, self.bits, self.maxq)
    else:
        raise NotImplementedError("Debugging 4-bit case")
        raise ValueError('Unsupported bitwidth.')
    if not self.disable_bias:
        out += self.bias
        assert not torch.isnan(out).any().item()
        assert not torch.isinf(out).any().item()
    return out

autograd_4bit.Autograd4bitQuantLinear.forward = _patched_linear_forward

So basically it trigger exception in a following cases:

And what do I see after a few runnings of the notebook? For all the cases error occured in the same place in the code

      8     if torch.isnan(tensor).any().item():
      9         raise ValueError(f"Got NaN in {name} input")
---> 10 result = module._nan_check_old_forward(*args, **kwargs)
     11 tensors = []
     12 if isinstance(result, torch.Tensor):

File [c:\Users\alex4321\anaconda3\envs\llama\lib\site-packages\accelerate-0.20.3-py3.10.egg\accelerate\hooks.py:165](file:///C:/Users/alex4321/anaconda3/envs/llama/lib/site-packages/accelerate-0.20.3-py3.10.egg/accelerate/hooks.py:165), in add_hook_to_module..new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

Cell In[5], line 26, in _patched_linear_forward(self, x)
     24     out += self.bias
     25     assert not torch.isnan(out).any().item()
---> 26     assert not torch.isinf(out).any().item()
     27 return out

AssertionError:

very strange. I would check input and biases now, and I will as well check the same patch on Linux system

alex4321 commented 1 year ago

Updated debugging monkeypatch:

    assert not torch.isnan(x).any().item()
    assert not torch.isinf(x).any().item()
    if self.bits == 4:
        if torch.is_grad_enabled():
            out = autograd_4bit.AutogradMatmul4bit.apply(x, self.qweight, self.scales,
                                        self.qzeros if not self.is_v1_model else self.zeros,
                                        self.g_idx, self.bits, self.maxq)
            assert not torch.isnan(out).any().item()
            assert not torch.isinf(out).any().item()
        else:
            out = autograd_4bit.matmul4bit_with_backend(x, self.qweight, self.scales,
                                        self.qzeros if not self.is_v1_model else self.zeros,
                                        self.g_idx, self.bits, self.maxq, self.groupsize)
            assert not torch.isnan(out).any().item()
            assert not torch.isinf(out).any().item()
    elif self.bits == 2:
        raise NotImplementedError("Debugging 4-bit case")
        out = AutogradMatmul2bit.apply(x, self.qweight, self.scales, self.qzeros, self.g_idx, self.bits, self.maxq)
    else:
        raise NotImplementedError("Debugging 4-bit case")
        raise ValueError('Unsupported bitwidth.')
    if not self.disable_bias:
        assert not torch.isnan(self.bias).any().item()
        assert not torch.isinf(self.bias).any().item()
        out += self.bias
        assert not torch.isnan(out).any().item()
        assert not torch.isinf(out).any().item()
    return out

autograd_4bit.Autograd4bitQuantLinear.forward = _patched_linear_forward

So now it checks bias values before trying to sum them.

And it fails:

Cell In[5], line 25, in _patched_linear_forward(self, x)
     23 if not self.disable_bias:
     24     assert not torch.isnan(self.bias).any().item()
---> 25     assert not torch.isinf(self.bias).any().item()
     26     out += self.bias
     27     assert not torch.isnan(out).any().item()

AssertionError: 

So seems like something wrong with loading the model in memory? It becomes more and more strange.

alex4321 commented 1 year ago

Okay, that's even more sick:

with torch.no_grad():
     generated = model.generate(inputs=batch["input_ids"],
                                do_sample=True, use_cache=True,
                                repetition_penalty=1.1,
                                max_new_tokens=1,
                                temperature=0.5,
                                top_p=0.97,
                                top_k=40,
                                return_dict_in_generate=True,
                                output_attentions=False,
                                output_hidden_states=False,
                                output_scores=False)
Cell In[5], line 25, in _patched_linear_forward(self, x)
     23 if not self.disable_bias:
     24     assert not torch.isnan(self.bias).any().item()
---> 25     assert not torch.isinf(self.bias).any().item()
     26     out += self.bias
     27     assert not torch.isnan(out).any().item()

AssertionError: 

But:

modules = list(model.modules())
modules = [m for m in modules if isinstance(m, autograd_4bit.Autograd4bitQuantLinear)]
for i, m in enumerate(modules):
    if torch.isnan(modules[0].bias).any().item() or torch.isinf(modules[0].bias).any().item():
        print(i)

it prints nothing

But:

with torch.no_grad():
     generated = model.generate(inputs=batch["input_ids"],
                                do_sample=True, use_cache=True,
                                repetition_penalty=1.1,
                                max_new_tokens=1,
                                temperature=0.5,
                                top_p=0.97,
                                top_k=40,
                                return_dict_in_generate=True,
                                output_attentions=False,
                                output_hidden_states=False,
                                output_scores=False)

again tells me:

Cell In[5], line 25, in _patched_linear_forward(self, x)
     23 if not self.disable_bias:
     24     assert not torch.isnan(self.bias).any().item()
---> 25     assert not torch.isinf(self.bias).any().item()
     26     out += self.bias
     27     assert not torch.isnan(out).any().item()

AssertionError: 
alex4321 commented 1 year ago

Oh! modules[0], sure. Fixed the bias-checking loop, now it gives me consisent result.

modules = list(model.modules())
modules = [m for m in modules if isinstance(m, autograd_4bit.Autograd4bitQuantLinear)]
for i, m in enumerate(modules):
    assert not torch.isnan(m.bias).any().item()
    assert not torch.isinf(m.bias).any().item()

So - it fails.

I will now check the same for linux colab.

alex4321 commented 1 year ago

I added the same bias-NaN/Inf checking loop before the inference call inside https://colab.research.google.com/drive/1Topc3AAYEbSesK6gzj_kQdrmYyoFcRbV?usp=sharing

So if bias is Inf or NaN after loading the model it will fail even before actual inference. So from 5 attempts:

2 failed on bias check after loading
1 failed on NaN/Inf checking during inference (ValueError: Got NaN in Layer 0 self-attention o_proj output)
2 succeeded

Besides, there is used torch versions:

Windows machine:

pip show torch
Name: torch
Version: 2.0.1+cu117
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: c:\users\alex4321\anaconda3\envs\llama\lib\site-packages
Requires: filelock, jinja2, networkx, sympy, typing-extensions
Required-by: accelerate, alpaca-lora-4bit, peft

Colab:

Name: torch
Version: 2.0.1+cu118
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [packages@pytorch.org](mailto:packages@pytorch.org)
License: BSD-3
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, jinja2, networkx, sympy, triton, typing-extensions
Required-by: accelerate, alpaca-lora-4bit, fastai, peft, torchaudio, torchdata, torchtext, torchvision, triton
alex4321 commented 1 year ago

Guess now I will check it not with https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g/tree/main - instead I will use the same model as with my first experiments ( https://huggingface.co/TheBloke/vicuna-7B-GPTQ-4bit-128g ).

But I expect result to be the same

alex4321 commented 1 year ago

Yeah. This one https://huggingface.co/TheBloke/vicuna-7B-GPTQ-4bit-128g gives me the same bias issue

alex4321 commented 1 year ago

Now I will make simplified test which will run the same model loading+bias checking result loop many times and see what will be the outcome than

alex4321 commented 1 year ago

Okay, I made this notebook: https://github.com/alex4321/alpaca_lora_4bit/blob/fix-nan-or-inf-after-linear/test-loading.ipynb / https://drive.google.com/file/d/1BVe5tgneendgjAgfqYM4Z1-bdVPVksrm/view?usp=sharing

It:

On windows I go the following exit codes

1    10
Name: count, dtype: int64

means every run crashed

alex4321 commented 1 year ago

And inside colab: https://drive.google.com/file/d/1BVe5tgneendgjAgfqYM4Z1-bdVPVksrm/view?usp=sharing

0    7
1    3
dtype: int64

So 7 runs without Inf or NaN in biases, but 3 with them

alex4321 commented 1 year ago

@johnsmith0031 sounds very strange, but in the end it seems like a model loading issue. Just occuring stable in Windows and much less stable in Linux

alex4321 commented 1 year ago

Okay, probably not exactly loading:

import torch
from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram, Autograd4bitQuantLinear, switch_backend_to

switch_backend_to("cuda")

model, tokenizer = load_llama_model_4bit_low_ram("vicuna-7B-GPTQ-4bit-128g",
                                                 "vicuna-7B-GPTQ-4bit-128g/vicuna-7B-GPTQ-4bit-128g.safetensors",
                                                 groupsize=128)

modules = list(model.modules())
modules = [m for m in modules if isinstance(m, Autograd4bitQuantLinear)]
for i, m in enumerate(modules):
    if not m.disable_bias:
        assert not torch.isnan(m.bias).any().item(), "Failed before conversion"
        assert not torch.isinf(m.bias).any().item(), "Failed before conversion"

model.half();
for n, m in model.named_modules():
    if isinstance(m, Autograd4bitQuantLinear):
        if m.is_v1_model:
            m.zeros = m.zeros.half()
        m.scales = m.scales.half()
        m.bias = m.bias.half()

modules = list(model.modules())
modules = [m for m in modules if isinstance(m, Autograd4bitQuantLinear)]
for i, m in enumerate(modules):
    if not m.disable_bias:
        assert not torch.isnan(m.bias).any().item(), "Failed after conversion"
        assert not torch.isinf(m.bias).any().item(), "Failed after conversion"

It gives me

Triton not found. Please run "pip install triton".
Using CUDA implementation.
Loading Model ...
Loaded the model in 4.73 seconds.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The safetensors archive passed at vicuna-7B-GPTQ-4bit-128g/vicuna-7B-GPTQ-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
Traceback (most recent call last):
  File "c:\Users\alex4321\Documents\alpaca_lora_4bit\__test.py", line 31, in <module>
    assert not torch.isinf(m.bias).any().item(), "Failed after conversion"
AssertionError: Failed after conversion

on windows system.

Probably I will make linux tests tomorrow.What makes me worry anyway

johnsmith0031 commented 1 year ago

one possible reason is that the bias is initialized as float32, and the model without bias value would not override the bias which is randomly initialized (sometimes make it out of range on float16, so inf or nan issue occurred). Maybe one fix is first initializing all bias value to be zero and then loading the weight.

alex4321 commented 1 year ago

Yes, but still I see it strange.

Maybe I will continue my debug diving today, not sure yet.

What I can not comprehend is:

BUT:

alex4321 commented 1 year ago

Yeah, a really strange thing. I runned the following cells:

import os
import gc
import torch
from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram, Autograd4bitQuantLinear, switch_backend_to

switch_backend_to("cuda")
if not os.path.exists("vicuna-7B-GPTQ-4bit-128g"):
    !git clone https://huggingface.co/TheBloke/vicuna-7B-GPTQ-4bit-128g
def get_biases(model):
    biases = {}
    for name, module in model.named_modules():
        if isinstance(module, Autograd4bitQuantLinear):
            if not module.disable_bias:
                biases[name] = module.bias.detach().cpu().numpy()
    return biases
model, _ = load_llama_model_4bit_low_ram("vicuna-7B-GPTQ-4bit-128g",
                                         "vicuna-7B-GPTQ-4bit-128g/vicuna-7B-GPTQ-4bit-128g.safetensors",
                                         groupsize=128)
biases1 = get_biases(model)
model.cpu()
del model
gc.collect()
torch.cuda.empty_cache()
model, _ = load_llama_model_4bit_low_ram("vicuna-7B-GPTQ-4bit-128g",
                                         "vicuna-7B-GPTQ-4bit-128g/vicuna-7B-GPTQ-4bit-128g.safetensors",
                                         groupsize=128)
biases2 = get_biases(model)

So basically it's about loading weights from a model, and using only the layers which have enabled bias - so initialization should not be an issue, should it? (But I will dive into initialization code later) And seen biases1 / biases2.

And in biases1 I see

 'model.layers.0.self_attn.k_proj': array([5.0992770e-27, 7.5670117e-43, 5.0992770e-27, ..., 7.5670117e-43,
        5.0994742e-27, 7.5670117e-43], dtype=float32),

But in biases2 I see

 'model.layers.0.self_attn.k_proj': array([3.13695979e+32, 1.08803436e+24, 1.16357385e+24, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00], dtype=float32),

So I will dive into model initialization sometime today or tomorrow to see when exactly the things go so strange.

alex4321 commented 1 year ago

Yeah, I checked saved model:

tensors = safetensors.torch.load_file("vicuna-7B-GPTQ-4bit-128g/vicuna-7B-GPTQ-4bit-128g.safetensors")
[
    name
    for name in tensors.keys()
    if "model.layers.0.self_attn.k_proj" in name
]
['model.layers.0.self_attn.k_proj.g_idx',
 'model.layers.0.self_attn.k_proj.qweight',
 'model.layers.0.self_attn.k_proj.qzeros',
 'model.layers.0.self_attn.k_proj.scales']

it really does not have bias weight here, despite bias setted up as enabled after model instantiating.

I'll think later.

p.s. setting bias initializer to almost zero would help, I guess, but I wonder why there are no saved parameter for bias enabling/disabling or something like so. But I guess at least I see the issue source.

(It's still strange for me why see different behaviour on windows and linux than... difference in random generation? Hm, sounds kinda mad for me yet. But okay, I guess by the end of diving into this issue I will be like some Lovecraft characters realizing their ignorance was their blessing, lol)

alex4321 commented 1 year ago

And, yeah, I see:

        if is_v1_model:
            self.register_buffer('zeros', torch.empty((out_features, 1)))
            self.register_buffer('scales', torch.empty((out_features, 1)))
            self.g_idx = None
        else:
            self.register_buffer('qzeros',
                                  torch.empty((math.ceil(in_features/groupsize), out_features // 256 * (bits * 8)), dtype=torch.int32)
                                )
            self.register_buffer('scales', torch.empty((math.ceil(in_features/groupsize), out_features)))
            self.register_buffer('g_idx', torch.tensor([i // self.groupsize  for i in range(in_features)], dtype = torch.int32))
        self.register_buffer('bias', torch.empty(out_features))
        self.register_buffer(
            'qweight', torch.empty((in_features // 256 * (bits * 8), out_features), dtype=torch.int32)
        )

Initializing as empty at self.register_buffer('bias', torch.empty(out_features)) is possibly not the best idea. You see:

Returns a tensor filled with uninitialized data So in the best case it would ruin computatuions like my case, but may as well lead to gibberish result.

I guess by default, when we have no bias weights - we should expect no biases? So initializing with zeros-like values should be fine. And I guess it's not the same problem as initializing matrix multiplication weights with a constant (which is bad for optimization) - so we may as well use zero itself?

So I guess I would make some testing here and send a patch.

alex4321 commented 1 year ago

Closing the issue since the solution was found (#126)