rmihaylov / falcontune

Tune any FALCON in 4-bit
Apache License 2.0
468 stars 51 forks source link

RuntimeError: No available kernel. Aborting execution. #11

Open RealCalumPlays opened 1 year ago

RealCalumPlays commented 1 year ago

Any ideas? Full log below:

Traceback (most recent call last): File "/home/cosmos/miniconda3/envs/ftune/bin/falcontune", line 33, in sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')()) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/run.py", line 87, in main args.func(args) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/finetune.py", line 162, in finetune trainer.train() File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train return inner_training_loop( File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step loss = self.compute_loss(model, inputs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 2767, in compute_loss outputs = model(inputs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward return self.base_model( File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, kwargs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 1070, in forward transformer_outputs = self.transformer( File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(args, kwargs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 965, in forward outputs = block( File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, *kwargs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 698, in forward attn_outputs = self.self_attention( File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 337, in forward attn_output = F.scaled_dot_product_attention( RuntimeError: No available kernel. Aborting execution.

EDIT: CUDA is installed in kernel modules, on the system & in the environment just to rule out that. Using python 3.10.6

itjuba commented 1 year ago

same error here on Tesla V100-SXM2-32GB

rmihaylov commented 1 year ago

There is a choice of three kernels:

torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False

Currently, only flash attention is on. Try enabling the other options as well.

chintan-donda commented 1 year ago

same error here on Tesla V100-SXM2-32GB

Same issue for me as well on the same machine, with below details: OS: Ubuntu 18.04.5 LTS Libs:

bitsandbytes==0.39.0
transformers==4.29.2
triton==2.0.0
sentencepiece==0.1.99
datasets==2.12.0
peft==0.3.0
torch==2.0.1+cu118
accelerate==0.19.0
safetensors==0.3.1
einops==0.6.1
wandb==0.15.3
bitsandbytes==0.39.0
scipy==1.10.1
chintan-donda commented 1 year ago

There is a choice of three kernels:

torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False

Currently, only flash attention is on. Try enabling the other options as well.

Doing this giving the below error:

Traceback (most recent call last):
  File "falcontune/run.py", line 93, in <module>
    main()
  File "falcontune/run.py", line 89, in main 
    args.func(args)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/finetune.py", line 162, in fin
etune
    trainer.train()
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)  
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/transformers/trainer.py", line 2767, in compute_loss
    outputs = model(**inputs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/falcon/model.py", line 1070, in forward
    transformer_outputs = self.transformer(  
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/falcon/model.py", line 965, in forward
    outputs = block(
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/falcon/model.py", line 634, in forward
      attn_outputs = self.self_attention(
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/falcon/model.py", line 486, in forward
    fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/lora.py", line 54, in forward
    result = self.quant_class.forward(self, x)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/backend/triton/quantlinear.py", line 13, in forward
    out = AutogradMatmul.apply(
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/backend/triton/autograd.py", line 11, in forward
    output = tu.triton_matmul(x, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/backend/triton/triton_utils.py", line 246, in triton_matmul
    matmul_248_kernel[grid](input, qweight, output,
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/backend/triton/custom_autotune.py", line 110, in run
    return self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **kwargs, **config.kwargs)
  File "<string>", line 24, in matmul_248_kernel
ValueError: Pointer argument (at 1) cannot be accessed from Triton (cpu tensor?)
fpena06 commented 1 year ago

I was having this same issue on google colab v100, switching to a100 fixed it for me.

chintan-donda commented 1 year ago

Any fix for this? I'm still getting this issue.

wyklq commented 1 year ago

In V100, we need enable the mem_efficient mode, it doesn't support native flash attention.

--- a/falcontune/model/falcon/model.py
+++ b/falcontune/model/falcon/model.py
@@ -523,7 +523,7 @@ class Attention40B(nn.Module):
             key_layer_ = key_layer.reshape(batch_size, self.num_heads, -1, self.head_dim)
             value_layer_ = value_layer.reshape(batch_size, self.num_heads, -1, self.head_dim)

-            with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
+            with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=True):
                 attn_output = F.scaled_dot_product_attention(
                     query_layer_, key_layer_, value_layer_, None, 0.0, is_causal=True
                 )