Open JohnHBrock opened 1 year ago
A possible workaround is to wrap DataLoader2.__iter__
so that it gets recreated from scratch each time, rather than just resetting the existing DataLoader2 instance, for example something like this:
from torchdata.dataloader2 import DataLoader2
class DataLoader2Workaround():
def __init__(self, datapipe, reading_service):
self.datapipe = datapipe
self.reading_service = reading_service
self.dataloader2 = None
def _create_dataloader2(self):
self.dataloader2 = DataLoader2(self.datapipe, reading_service=self.reading_service)
def __getattr__(self, attr):
if self.dataloader2 is None:
self._create_dataloader2()
return getattr(self.dataloader2, attr)
def __iter__(self):
self._create_dataloader2()
return iter(self.dataloader2)
Possibly related to #1148.
🐛 Describe the bug
I'm aware torchdata isn't being maintained anymore, but thought I'd post this here for posterity:
When using
iter
twice for the same instance of DataLoader2, trying to iterate over the 2nd one results in a hang. One of the worker processes terminates due to an exception "Can not reset while we are still waiting response for previous request", although this isn't obvious unless you run a debugger. This exception occurs when one of the workers callsnonblocking_next()
here. Once this worker dies, the data loader is deadlocked.I noticed this when using Lightning with torchdata: Lightning's
fit
will run a few iterations of the validation loop as a sanity check before training, then do a training loop, followed by the validation loop again. This 2nd validation loop never finishes because of the hang.Code to reproduce:
This results in the output:
and nothing else. The data loader processes continue to run, except for the one terminating worker that I mentioned above.
Versions
Collecting environment information... PyTorch version: 2.0.1 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A
OS: macOS 13.4.1 (x86_64) GCC version: Could not collect Clang version: 14.0.3 (clang-1403.0.22.14.1) CMake version: version 3.26.4 Libc version: N/A
Python version: 3.8.17 (default, Jul 19 2023, 14:02:02) [Clang 14.0.3 (clang-1403.0.22.14.1)] (64-bit runtime) Python platform: macOS-13.4.1-x86_64-i386-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz
Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] torch==2.0.1 [pip3] torchdata==0.6.1 [conda] Could not collect