Open HumzaSami00 opened 1 year ago
are GPU drivers up to date ?
your code looks reasonable, still you can try --
training_arguments = TrainingArguments(
per_device_train_batch_size=4, # Reduced batch size
gradient_accumulation_steps=8, # Increased gradient accumulation steps
optim="paged_adamw_32bit",
learning_rate=4e-4,
fp16=True,
max_grad_norm=0.3,
num_train_epochs=3,
warmup_ratio=0.05,
logging_steps=5,
save_total_limit=5,
save_strategy="steps",
save_steps=1,
group_by_length=True,
output_dir=output_dir,
report_to="tensorboard",
save_safetensors=True,
lr_scheduler_type="cosine",
seed=42
)
I have a question regarding my fine-tuning pipeline, specifically concerning a memory usage spike when the model saves checkpoint during the training step. This cause sudden CUDA Memory error.
I would like to provide the following information, including GPU usage logs and code snippets for reference: