triton.runtime.autotuner.OutOfResources: out of resource: shared memory. Reducing block sizes or `num_stages` may help.

shashank-agg commented 2 weeks ago

Hi. I run into this error when trying to fine-tune Phi3 small: triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 180224, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.

The GPU is a 24GB RTX 3090.

My code is based on this qlora cookbook. Any ideas what the issue might be?

Minimal steps to reproduce


import torch
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training, TaskType

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

model_name = "microsoft/Phi-3-small-8k-instruct"
dataset = load_dataset("gate369/Alpaca-Star", split="train")

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.bos_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'left'

def create_message_column(row):
    messages = []
    user = {
        "content": f"{row['instruction']}\n Input: {row['input']}",
        "role": "user"
    }
    messages.append(user)
    assistant = {
        "content": f"{row['output']}",
        "role": "assistant"
    }
    messages.append(assistant)
    return {"messages": messages}

def format_dataset_chatml(row):
    return {"text": tokenizer.apply_chat_template(row["messages"], add_generation_prompt=False, tokenize=False)}

dataset_chatml = dataset.map(create_message_column)
dataset_chatml = dataset_chatml.map(format_dataset_chatml)
dataset_chatml = dataset_chatml.train_test_split(test_size=0.05, seed=1234)

compute_dtype = torch.bfloat16
attn_implementation = 'flash_attention_2'

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype='bfloat16',
        bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=compute_dtype, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0},
          attn_implementation=attn_implementation
)

model = prepare_model_for_kbit_training(model)

args = TrainingArguments(
        output_dir="./phi-3-mini-LoRA",
        evaluation_strategy="steps",
        do_eval=True,
        optim="adamw_torch",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        per_device_eval_batch_size=1,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=100,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=100,
        num_train_epochs=3,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        seed=42,
)

peft_config = LoraConfig(
        r=16,
        lora_alpha=16,
        lora_dropout=0.05,
        task_type=TaskType.CAUSAL_LM,
        target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"],
)

trainer = SFTTrainer(
        model=model,
        train_dataset=dataset_chatml['train'],
        eval_dataset=dataset_chatml['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=args,
)

trainer.train()

Any log messages given by the failure

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.11it/s]
/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/training_args.py:1494: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/training_args.py:1494: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:318: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
  warnings.warn(
Using auto half precision backend
Currently training with a batch size of: 1
***** Running training *****
  Num examples = 397
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 1,191
  Number of trainable parameters = 26,214,400
  0%|                                                                                                                                                                                                                                                  | 0/1191 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/home/artificial-brain/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/69caae1f2acea34b26f535fecb1f2abb9a304695/triton_flash_blocksparse_attn.py:88: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at ../aten/src/ATen/SparseCsrTensorImpl.cpp:53.)
  x = [xi.to_sparse_csr() for xi in x]
Traceback (most recent call last):
  File "/home/artificial-brain/ai-projects/notebooks/Phi-3-finetune-qlora-python.py", line 101, in <module>
    trainer.train()
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 361, in train
    output = super().train(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/trainer.py", line 3307, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/trainer.py", line 3338, in compute_loss
    outputs = model(**inputs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/peft/peft_model.py", line 1430, in forward
    return self.base_model(
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 179, in forward
    return self.model.forward(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/artificial-brain/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/69caae1f2acea34b26f535fecb1f2abb9a304695/modeling_phi3_small.py", line 956, in forward
    outputs = self.model(
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/artificial-brain/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/69caae1f2acea34b26f535fecb1f2abb9a304695/modeling_phi3_small.py", line 849, in forward
    layer_outputs = self._gradient_checkpointing_func(
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
    return fn(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
    return fn(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 487, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 262, in forward
    outputs = run_function(*args)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/artificial-brain/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/69caae1f2acea34b26f535fecb1f2abb9a304695/modeling_phi3_small.py", line 671, in forward
    hidden_states, self_attn_weights, present_key_values = self.self_attn(
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/artificial-brain/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/69caae1f2acea34b26f535fecb1f2abb9a304695/modeling_phi3_small.py", line 616, in forward
    attn_function_output = self._apply_blocksparse_attention(
  File "/home/artificial-brain/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/69caae1f2acea34b26f535fecb1f2abb9a304695/modeling_phi3_small.py", line 382, in _apply_blocksparse_attention
    context_layer = self._blocksparse_layer(
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/artificial-brain/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/69caae1f2acea34b26f535fecb1f2abb9a304695/triton_blocksparse_attention_layer.py", line 165, in forward
    return blocksparse_flash_attn_padded_fwd(
  File "/home/artificial-brain/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/69caae1f2acea34b26f535fecb1f2abb9a304695/triton_flash_blocksparse_attn.py", line 994, in blocksparse_flash_attn_padded_fwd
    _fwd_kernel_batch_inference[grid](
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run
    return self.fn.run(*args, **kwargs)
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/triton/runtime/jit.py", line 425, in run
    kernel.run(grid_0, grid_1, grid_2, kernel.num_warps, kernel.num_ctas,  # number of warps/ctas per instance
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/triton/compiler/compiler.py", line 255, in __getattribute__
    self._init_handles()
  File "/home/artificial-brain/anaconda3/envs/llama-factory/lib/python3.10/site-packages/triton/compiler/compiler.py", line 248, in _init_handles
    raise OutOfResources(self.shared, max_shared, "shared memory")
triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 180224, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.

leestott commented 2 weeks ago

The error message you're encountering suggests that the Triton autotuner is running out of shared memory resources while training your model.

This can happen when the block sizes are too large or if num_stages (the number of parallelism stages) isn't optimized well for your hardware setup.

To address this issue, you could try reducing the block sizes and/or adjusting the num_stages. Here is how you can modify your code to do so:

# Set a smaller batch size or reduce the number of parallelism stages
args = TrainingArguments(
        output_dir="./phi-3-mini-LoRA",
        evaluation_strategy="steps",
        do_eval=True,
        optim="adamw_torch",
        per_device_train_batch_size=2, # Reduce batch size to 2
        gradient_accumulation_steps=1,
        per_device_eval_batch_size=1,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=100,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=100,
        num_train_epochs=3,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        seed=42,
        num_stages=2  # Reduce parallelism stages to 2
)

Remember that reducing the batch size or num_stages may increase training time, but it should allow you to continue fine-tuning your model.

shashank-agg commented 2 weeks ago

Hi @leestott Thanks for the reply. Reducing per_device_train_batch_size to 1 throws the same error.

Also, num_stages doesn't seem to be a valid argument to TrainingArguments (docs)

leestott commented 2 weeks ago

Hi similar issues opened on Hugging face discussion

https://huggingface.co/microsoft/Phi-3-small-8k-instruct/discussions/15#665e1c81e69ab4882805c03b

https://huggingface.co/microsoft/Phi-3-small-128k-instruct/discussions/16#6663f3ff4ca290b4056d898a

shashank-agg commented 2 weeks ago

Thanks!

kinfey commented 2 weeks ago

The recommended adjustment layer is


"target_modules": [
"o_proj",
"qkv_proj"
]

microsoft / Phi-3CookBook

triton.runtime.autotuner.OutOfResources: out of resource: shared memory. Reducing block sizes or `num_stages` may help. #89

Minimal steps to reproduce

Any log messages given by the failure