Closed humanely closed 5 months ago
try adding this to the training script:
if torch.cuda.is_available():
os.environ["NCCL_SOCKET_NTIMEO"] = "2000000"
there's also InitProcessKwargs method of setting it:
# Create the custom configuration
process_group_kwargs = InitProcessGroupKwargs(
timeout=timedelta(seconds=5400)
) # 1.5 hours
accelerator = Accelerator(
gradient_accumulation_steps=args.gradient_accumulation_steps,
mixed_precision=args.mixed_precision,
log_with=args.report_to,
project_config=accelerator_project_config,
kwargs_handlers=[process_group_kwargs],
)
Thanks @bghira I think it solves the issue, otherwise I may increase it to 6 hours. Here is the screenshot of Mapping. Taking more than 5 hours in total. Is it normal? I am using MSCOCO image dataset.
it is quite typical and a common workaround but i don't have full time access to multi-gpu systems - would love to experiment with different workarounds.
they should be trivial to test by reducing the accelerator nccl timeout to a very low value and then try the workaround while running a blocking task for longer than the timeout.
i'm assuming there's some kind of way to signal to nccl that you're still "busy" and "alive".
for what it's worth this is why i cache things to disk in simpletuner, you can aim it at eg. a cloudflare R2 object storage bucket with training data, so that you can do preprocessing on one system and resume into the actual training job from a bigger/more expensive system.
Thanks for helping, @bghira!
@humanely the training scripts are not optimized to give you the best throughput during training. So, cannot guarantee that. Please refer to the simpletuner
repo of @bghira which has better offerings in this regard.
Closing the issue because the issue seems to be resolved now.
Describe the bug
While the training script
**train_text_to_image_lora_sdxl.py**
runs perfectly fine on A100 1 GPU machine, it fails to complete the data mapping on machines with multiple GPUs. I have tried following:export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1
Reproduction
Execute the example training script on A100 Machine with 4 GPU
Logs
System Info
Who can help?
@sayakpaul