Open deepanshu-a2z opened 2 hours ago
This seems to be unlike https://github.com/Lightning-AI/pytorch-lightning/issues/20226 or https://github.com/mosaicml/streaming/issues/767 since the issue seems to be present even without any distributed computing related configuration
Environment
Running on docker instance with github checkout of streaming "3a6a5490678a2efa028ed96ba9b8813fba8687eb"
To reproduce
Steps to reproduce the behavior: Creating PytorchLightning trainer object on a system with 1 GPU and 24 cpu nodes
Expected behavior
With num workers for dataloader at 4, it works seamlessly and the training loop progresses as expected.
Additional context
When I increase the num_workers to 16 or beyond, I keep getting
"FileExistsError: [Errno 17] File exists: '/000000_epoch_shape'"
. Usually I was able to clear this issue by clearing streaming dirs or rebooting machine but now this even that has stopped working.