pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.12k stars 149 forks source link

Calling __iter__ twice on DataLoader2 causes hang with MPRS #1198

Open JohnHBrock opened 1 year ago

JohnHBrock commented 1 year ago

🐛 Describe the bug

I'm aware torchdata isn't being maintained anymore, but thought I'd post this here for posterity:

When using iter twice for the same instance of DataLoader2, trying to iterate over the 2nd one results in a hang. One of the worker processes terminates due to an exception "Can not reset while we are still waiting response for previous request", although this isn't obvious unless you run a debugger. This exception occurs when one of the workers calls nonblocking_next() here. Once this worker dies, the data loader is deadlocked.

I noticed this when using Lightning with torchdata: Lightning's fit will run a few iterations of the validation loop as a sanity check before training, then do a training loop, followed by the validation loop again. This 2nd validation loop never finishes because of the hang.

Code to reproduce:

from torchdata.dataloader2 import DataLoader2, MultiProcessingReadingService
from torch.utils.data.datapipes.iter.sharding import SHARDING_PRIORITIES
from torchdata.datapipes.iter import IterableWrapper

def main():
    dp = IterableWrapper([1, 2, 3, 4, 5, 6, 7]*100).sharding_round_robin_dispatch(SHARDING_PRIORITIES.MULTIPROCESSING)
    reading_service = MultiProcessingReadingService(num_workers=2, main_prefetch_cnt=0, worker_prefetch_cnt=0)

    dataloader = DataLoader2(dp, reading_service=reading_service)
    print(next(iter(dataloader)))
    print(next(iter(dataloader)))
    print("done")

if __name__ == "__main__":
    main()

This results in the output:

1

and nothing else. The data loader processes continue to run, except for the one terminating worker that I mentioned above.

Versions

Collecting environment information... PyTorch version: 2.0.1 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: macOS 13.4.1 (x86_64) GCC version: Could not collect Clang version: 14.0.3 (clang-1403.0.22.14.1) CMake version: version 3.26.4 Libc version: N/A

Python version: 3.8.17 (default, Jul 19 2023, 14:02:02) [Clang 14.0.3 (clang-1403.0.22.14.1)] (64-bit runtime) Python platform: macOS-13.4.1-x86_64-i386-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz

Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] torch==2.0.1 [pip3] torchdata==0.6.1 [conda] Could not collect

JohnHBrock commented 1 year ago

A possible workaround is to wrap DataLoader2.__iter__ so that it gets recreated from scratch each time, rather than just resetting the existing DataLoader2 instance, for example something like this:

from torchdata.dataloader2 import DataLoader2

class DataLoader2Workaround():
    def __init__(self, datapipe, reading_service):
        self.datapipe = datapipe
        self.reading_service = reading_service
        self.dataloader2 = None

    def _create_dataloader2(self):
        self.dataloader2 = DataLoader2(self.datapipe, reading_service=self.reading_service)

    def __getattr__(self, attr):
        if self.dataloader2 is None:
            self._create_dataloader2()
        return getattr(self.dataloader2, attr)

    def __iter__(self):
        self._create_dataloader2()
        return iter(self.dataloader2)
JohnHBrock commented 1 year ago

Possibly related to #1148.