mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.1k stars 137 forks source link

Running into "FileExistsError: [Errno 17] File exists: '/000000_epoch_shape'" even with single GPU #802

Open deepanshu-a2z opened 2 hours ago

deepanshu-a2z commented 2 hours ago

Environment

Running on docker instance with github checkout of streaming "3a6a5490678a2efa028ed96ba9b8813fba8687eb"

To reproduce

Steps to reproduce the behavior: Creating PytorchLightning trainer object on a system with 1 GPU and 24 cpu nodes

trainer = pl.Trainer(
        max_epochs=5,
        accelerator="gpu",
        precision="16-mixed",
        devices=-1,
        val_check_interval=0.3,
        callbacks=callbacks,
        fast_dev_run=config['running']['fast_dev_run']
)

trainer.fit(model, train_dataloader, val_dataloaders=eval_dataloader)

Expected behavior

With num workers for dataloader at 4, it works seamlessly and the training loop progresses as expected.

Additional context

When I increase the num_workers to 16 or beyond, I keep getting "FileExistsError: [Errno 17] File exists: '/000000_epoch_shape'". Usually I was able to clear this issue by clearing streaming dirs or rebooting machine but now this even that has stopped working.

deepanshu-a2z commented 2 hours ago

This seems to be unlike https://github.com/Lightning-AI/pytorch-lightning/issues/20226 or https://github.com/mosaicml/streaming/issues/767 since the issue seems to be present even without any distributed computing related configuration