Open chenxingyu-cs opened 1 year ago
@ejguan Hi can you help provide some insights you have? Great thanks!
Are you running multiple DPP at the same time?
@ejguan I'm only running one DDP job. The DDP job is initialized by torchx. And I got these errors while running job on AWS Batch and SageMaker, where I believe all the instances are isolated and there should be no other job running.
🐛 Describe the bug
Hi, we found some strange during using Dataloader2. Here's some details about the issue.
traininig_args.eval_steps
training steps.IterableDataPipe
withShardingFilterIterDataPipe
Can you help provide some context on what could be the root cause and how to fix this? Thanks!
Log:
Versions