Closed qmin2 closed 3 weeks ago
This error occurs due to the discrepancy state between 2 GPUs Chaing the codes from
if (inputs['attention_mask'] == 0).any():
print("Skipping batch due to presence of padding.")
continue
to
padding_present = (inputs['attention_mask'] == 0).any()
padding_present = accelerator.gather(padding_present) # 모든 GPU의 정보를 모아서 일치시킴
if padding_present.any().item():
if accelerator.is_main_process:
print("Skipping batch due to presence of padding.")
continue
can solve the errors
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
This is my training loop
Expected behavior
I am fine-tuning the Llama3 8B model using DeepSpeed Zero Stage 2 with the accelerator library on two A100 80GB GPUs on the slurm cluster. I set validation to run every 500 steps, but I encounter timeout error below at the 500-step mark, causing the training to terminate. Notably, the validation dataset is very small, with only 26 examples.
However, when I reduce the validation interval to every 2 steps, this timeout error does not occur. Occasionally, the timeout error also appears at 200 steps.
What could be causing this issue and are there any recommendations or workarounds to resolve this? Thank you for your assistance!