Unable to fine-tune when not using quantization

Describe the bug When trying to train (fine-tune) a Mistral model, there is an error when not using quantization.

To Reproduce Steps to reproduce the behavior:

Install ludwig as usual and use the config provided below
Start the training process

Please provide code, yaml config file and a sample of data in order to entirely reproduce the issue. Issues that are not reproducible will be ignored.

Config file (model.yaml):

model_type: llm  
base_model: mistralai/Mistral-7B-Instruct-v0.2

#quantization: 
#  bits: 8

adapter:        
  type: lora  

prompt: 
  template: |   
    <s>                
    [INST]     
    Below is an instruction that describes a task, paired with an input that may provide further context.
    Write a response that appropriately completes the request.

    ### Instruction:
    {instruction}        
    [/INST]             

    ### Input:
    {input}        
    </s>

    ### Response:

input_features:
  - name: prompt
    type: text

output_features:
  - name: output
    type: text

trainer:
  type: finetune
  learning_rate: 2.0e-4
  batch_size: 1
  gradient_accumulation_steps: 16
  epochs: 2
  learning_rate_scheduler:
    decay: cosine
    warmup_fraction: 0.03
    reduce_on_plateau: 0

preprocessing:
  sample_ratio: 0.1

backend:
  type: local

Command: ludwig train --config model.yaml --dataset "ludwig://alpaca"

Experiment description:

╒══════════════════╤══════════════════════════════════════════════════════════════════════════════════════╕                                                                                                                                                                                                                  
│ Experiment name  │ experiment                                                                           │                                                                                                                                                                                                                  
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                                                                                                                  
│ Model name       │ run                                                                                  │                                                                                                                                                                                                                  
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                                                                                                                  
│ Output directory │ /home/azureuser/ludwig/results/experiment_run_14                                     │                                                                                                                                                                                                                  
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                                                                                                                  
│ ludwig_version   │ '0.9.3'                                                                              │                                                                                                                                                                                                                  
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                                                                                                                  
│ command          │ ('/home/azureuser/ludwig/venv/bin/ludwig train --config model.yaml --dataset '       │                                                                                                                                                                                                                  
│                  │  'ludwig://alpaca')                                                                  │                                                   
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                                                                                                                  
│ random_seed      │ 42                                                                                   │                                                   
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                                                                                                                  
│ dataset          │ 'ludwig://alpaca'                                                                    │                                                   
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤                                                   
│ data_format      │ 'ludwig'                                                                             │                                                   
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤                                                   
│ torch_version    │ '2.2.0+cu121'                                                                        │                                                   
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤ 
│ compute          │ {   'arch_list': [   'sm_50',                                                        │                                                                                                                                                                                                                  
│                  │                      'sm_60',                                                        │                                                   
│                  │                      'sm_70',                                                        │                                                                                                                                                                                                                  
│                  │                      'sm_75',                                                        │                                                   
│                  │                      'sm_80',                                                        │                                                   
│                  │                      'sm_86',                                                        │                                                   
│                  │                      'sm_90'],                                                       │                                                   
│                  │     'devices': {   0: {   'device_capability': (8, 0),                               │                                                   
│                  │                           'device_properties': "_CudaDeviceProperties(name='NVIDIA " │                                                   
│                  │                                                "A100 80GB PCIe', major=8, "          │                                                   
│                  │                                                'minor=0, total_memory=81049MB, '     │                                                   
│                  │                                                'multi_processor_count=108)',         │                                                   
│                  │                           'gpu_type': 'NVIDIA A100 80GB PCIe'}},                     │                                                   
│                  │     'gencode_flags': '-gencode compute=compute_50,code=sm_50 -gencode '              │                                                   
│                  │                      'compute=compute_60,code=sm_60 -gencode '                       │
│                  │                      'compute=compute_70,code=sm_70 -gencode '                       │
│                  │                      'compute=compute_75,code=sm_75 -gencode '                       │
│                  │                      'compute=compute_80,code=sm_80 -gencode '                       │
│                  │                      'compute=compute_86,code=sm_86 -gencode '                       │
│                  │                      'compute=compute_90,code=sm_90',                                │
│                  │     'gpus_per_node': 1,                                                              │
│                  │     'num_nodes': 1}                                                                  │
╘══════════════════╧══════════════════════════════════════════════════════════════════════════════════════╛

User-specified config (with upgrades):

{   'adapter': {'type': 'lora'},
    'backend': {'type': 'local'},
    'base_model': 'mistralai/Mistral-7B-Instruct-v0.2',
    'input_features': [{'name': 'prompt', 'type': 'text'}],
    'ludwig_version': '0.9.3',
    'model_type': 'llm',
    'output_features': [{'name': 'output', 'type': 'text'}],
    'preprocessing': {'sample_ratio': 0.1},
    'prompt': {   'template': '<s>\n'
                              '[INST]\n'
                              'Below is an instruction that describes a task, ' 
                              'paired with an input that may provide further '
                              'context.\n'
                              'Write a response that appropriately completes '
                              'the request.\n'
                              '\n'
                              '### Instruction:\n'
                              '{instruction}\n'
                              '[/INST]\n'
                              '\n'
                              '### Input:\n'
                              '{input}\n'
                              '</s>\n'
                              '\n'
                              '### Response:\n'},
    'trainer': {   'batch_size': 1,
                   'epochs': 2,
                   'gradient_accumulation_steps': 16,
                   'learning_rate': 0.0002,
                   'learning_rate_scheduler': {   'decay': 'cosine',
                                                  'reduce_on_plateau': 0,
                                                  'warmup_fraction': 0.03},
                   'type': 'finetune'}}

Expected behavior Training would start.

Screenshots

Starting with step 0, epoch: 0
Training:   0%|                                                                                                                                                                                                                                                                                     | 0/7280 [00:00<?, ?it/s]
/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:685: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:415.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:685: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:456.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:685: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:417.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:685: UserWarning: Expected query, key and value to all be of dtype: {Half, BFloat16}. Got Query dtype: float, Key dtype: float, and Value dtype: float instead. (Triggered internally at ../aten/src/ATen/native/tra
nsformers/sdp_utils_cpp.h:101.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Traceback (most recent call last):
  File "/home/azureuser/ludwig/venv/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/cli.py", line 197, in main
    CLI()
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/cli.py", line 72, in __init__
    getattr(self, args.command)()
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/cli.py", line 77, in train
    train.cli(sys.argv[2:])
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/train.py", line 395, in cli
    train_cli(**vars(args))
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/train.py", line 185, in train_cli
    model.train(
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/api.py", line 678, in train
    train_stats = trainer.train(
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/trainers/trainer.py", line 1044, in train
    should_break, has_nan_or_inf_tensors = self._train_loop(
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/trainers/trainer.py", line 1241, in _train_loop
    loss, all_losses, used_tokens = self.train_step(inputs, targets, should_step=should_step, profiler=profiler)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/trainers/trainer.py", line 337, in train_step
    model_outputs = self.dist_model((inputs, targets))
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/models/llm.py", line 274, in forward
    model_outputs = self.model(input_ids=self.model_inputs, attention_mask=self.attention_masks).get(LOGITS)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/peft/peft_model.py", line 1083, in forward
    return self.base_model(
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
    return self.model.forward(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1154, in forward
    outputs = self.model(
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1039, in forward
    layer_outputs = decoder_layer(
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 754, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 685, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: No available kernel. Aborting execution.
Training:   0%|

Environment (please complete the following information):

OS: Ubuntu
Version: 22.04.4 LTS
Python version: 3.10.12
Ludwig version: 0.9.3

Additional context With 8-bit quantization it works:

Starting with step 0, epoch: 0
Training:   0%|                                                                                                                                                                                                                                                                                     | 0/7280 [00:00<?, ?it/s]/home/azureuser/ludwig/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Training:   6%|██████████████▍                                                                                                                                                                                                                                             | 416/7280 [04:04<1:03:58,  1.79it/s, loss=0.0647]

ludwig-ai / ludwig

Unable to fine-tune when not using quantization #3939