Open shashank-agg opened 2 weeks ago
The error message you're encountering suggests that the Triton autotuner is running out of shared memory resources while training your model.
This can happen when the block sizes are too large or if num_stages
(the number of parallelism stages) isn't optimized well for your hardware setup.
To address this issue, you could try reducing the block sizes and/or adjusting the num_stages
. Here is how you can modify your code to do so:
# Set a smaller batch size or reduce the number of parallelism stages
args = TrainingArguments(
output_dir="./phi-3-mini-LoRA",
evaluation_strategy="steps",
do_eval=True,
optim="adamw_torch",
per_device_train_batch_size=2, # Reduce batch size to 2
gradient_accumulation_steps=1,
per_device_eval_batch_size=1,
log_level="debug",
save_strategy="epoch",
logging_steps=100,
learning_rate=1e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
eval_steps=100,
num_train_epochs=3,
warmup_ratio=0.1,
lr_scheduler_type="linear",
seed=42,
num_stages=2 # Reduce parallelism stages to 2
)
Remember that reducing the batch size or num_stages
may increase training time, but it should allow you to continue fine-tuning your model.
Hi @leestott Thanks for the reply.
Reducing per_device_train_batch_size
to 1 throws the same error.
Also, num_stages
doesn't seem to be a valid argument to TrainingArguments
(docs)
Hi similar issues opened on Hugging face discussion
https://huggingface.co/microsoft/Phi-3-small-8k-instruct/discussions/15#665e1c81e69ab4882805c03b
https://huggingface.co/microsoft/Phi-3-small-128k-instruct/discussions/16#6663f3ff4ca290b4056d898a
Thanks!
The recommended adjustment layer is
"target_modules": [
"o_proj",
"qkv_proj"
]
Hi. I run into this error when trying to fine-tune Phi3 small:
triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 180224, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.
The GPU is a 24GB RTX 3090.
My code is based on this qlora cookbook. Any ideas what the issue might be?
Minimal steps to reproduce
Any log messages given by the failure