Memory leak in torch_geometric.loader.Collater when num_workers > 0

johnpeterflynn commented 2 years ago

🐛 Bug

When a torch_geometric.loader.Dataloader object attempts to collate a batch of torch_geometric.data.Data objects and greater than 0 workers are used, the Dataloader's collate function (torch_geometric.loader.Collator.collate()) allocates a Data object here in such a way that shared memory is leaked.. This causes the shared memory to eventually reach capacity.

To Reproduce

Steps to reproduce the behavior:

Create a vanilla Dataloader that initializes a large torch_geometric.data.Data in __getitem__(). For example, set data.x = torch.ones((100000, 10)).
Create a torch_geometric.loader.Dataloader object with num_workers > 0, pin_memory = True, batch_size = 1.

Load contents in an empty for loop:

for batch_idx, data in enumerate(dataloader):
continue

Watch the system memory grow until the program crashes.

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Traceback (most recent call last): File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/queue.py", line 179, in get self.not_empty.wait(remaining) File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/threading.py", line 306, in wait gotit = waiter.acquire(True, timeout) File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 3456922) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train.py", line 80, in main(config) File "train.py", line 50, in main trainer.train() File "/home/flynn/workspace/thesis/remote/surface_conv/base/base_trainer.py", line 60, in train result = self._train_epoch(epoch) File "/home/flynn/workspace/thesis/remote/surface_conv/trainers/imagegraphcolortrainer.py", line 239, in _train_epoch for batch_idx, data in enumerate(self.data_loader.train_loader): File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data idx, data = self._get_data() File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1142, in _get_data success, data = self._try_get_data() File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 3456922) exited unexpectedly

Expected behavior

Memory should be allocated in shared memory in the same style as default_collate() provided by PyTorch here.

Environment

OS: Ubuntu 20.04
Python version: 3.8
PyTorch version: 1.9.1
CUDA/cuDNN version: 10.2
GCC version: 8.4.0

rusty1s commented 2 years ago

Thanks for reporting. I cannot really reproduce the problem with the following code. Am I doing something wrong?

import torch
 import tqdm
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader

class MyDataset(torch.utils.data.Dataset):
    def __getitem__(self, idx):
        x = torch.ones((100000, 10))
        return Data(x=x)

    def __len__(self):
        return 1000

dataset = MyDataset()
oader = DataLoader(dataset, batch_size=1, num_workers=6,
                   persistent_workers=True, pin_memory=True)

for _ in tqdm.tqdm(range(1000)):
    for data in loader:
        pass

Nonetheless, I agree with the allocation strategy for shared memory, this is definitely needed! Will work on integrating this.

rusty1s commented 2 years ago

I added a potential fix in https://github.com/pyg-team/pytorch_geometric/pull/3401. Can you check if that fixes the issue for you?

johnpeterflynn commented 2 years ago

hmm I upgraded to pyg 2.0.2 and that reproductive case actually doesn't create a problem anymore. But some shared memory leak still appears to exist. I'm not sure how to reproduce this new version of the issue but with 2.0.2 and num_workers > 0 my training loop crashes (after many epochs) with the following error (which is triggered by the system running out of shared memory):

Traceback (most recent call last): File "/home/flynn/miniconda3/envs/vcgenv5/lib/python3.6/multiprocessing/queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "/home/flynn/miniconda3/envs/vcgenv5/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/home/flynn/miniconda3/envs/vcgenv5/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 319, in reduce_storage metadata = storage._sharefilename() RuntimeError: unable to open shared memory object in read-write mode

My training loop is simple. I'm loading dada into data.x and data.edge_index, giving it to a pyg DataLoader and using it to train a SAGEConv-based UNet architecture. However I HAVE NOT set persistent_workers=True.

Once I can find a reproductive case I'll report it here.

rusty1s commented 2 years ago

Thank you! I will also try to make some progress on my end.

pyg-team / pytorch_geometric