Open johnpeterflynn opened 2 years ago
Thanks for reporting. I cannot really reproduce the problem with the following code. Am I doing something wrong?
import torch
import tqdm
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader
class MyDataset(torch.utils.data.Dataset):
def __getitem__(self, idx):
x = torch.ones((100000, 10))
return Data(x=x)
def __len__(self):
return 1000
dataset = MyDataset()
oader = DataLoader(dataset, batch_size=1, num_workers=6,
persistent_workers=True, pin_memory=True)
for _ in tqdm.tqdm(range(1000)):
for data in loader:
pass
Nonetheless, I agree with the allocation strategy for shared memory, this is definitely needed! Will work on integrating this.
I added a potential fix in https://github.com/pyg-team/pytorch_geometric/pull/3401. Can you check if that fixes the issue for you?
hmm I upgraded to pyg 2.0.2 and that reproductive case actually doesn't create a problem anymore. But some shared memory leak still appears to exist. I'm not sure how to reproduce this new version of the issue but with 2.0.2 and num_workers > 0 my training loop crashes (after many epochs) with the following error (which is triggered by the system running out of shared memory):
Traceback (most recent call last): File "/home/flynn/miniconda3/envs/vcgenv5/lib/python3.6/multiprocessing/queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "/home/flynn/miniconda3/envs/vcgenv5/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/home/flynn/miniconda3/envs/vcgenv5/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 319, in reduce_storage metadata = storage._sharefilename() RuntimeError: unable to open shared memory object in read-write mode
My training loop is simple. I'm loading dada into data.x and data.edge_index, giving it to a pyg DataLoader and using it to train a SAGEConv-based UNet architecture. However I HAVE NOT set persistent_workers=True.
Once I can find a reproductive case I'll report it here.
Thank you! I will also try to make some progress on my end.
🐛 Bug
When a
torch_geometric.loader.Dataloader
object attempts to collate a batch oftorch_geometric.data.Data
objects and greater than 0 workers are used, the Dataloader's collate function (torch_geometric.loader.Collator.collate()
) allocates a Data object here in such a way that shared memory is leaked.. This causes the shared memory to eventually reach capacity.To Reproduce
Steps to reproduce the behavior:
torch_geometric.data.Data
in__getitem__()
. For example, setdata.x = torch.ones((100000, 10))
.torch_geometric.loader.Dataloader
object with num_workers > 0, pin_memory = True, batch_size = 1.ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Traceback (most recent call last): File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/queue.py", line 179, in get self.not_empty.wait(remaining) File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/threading.py", line 306, in wait gotit = waiter.acquire(True, timeout) File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 3456922) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "train.py", line 80, in
main(config)
File "train.py", line 50, in main
trainer.train()
File "/home/flynn/workspace/thesis/remote/surface_conv/base/base_trainer.py", line 60, in train
result = self._train_epoch(epoch)
File "/home/flynn/workspace/thesis/remote/surface_conv/trainers/imagegraphcolortrainer.py", line 239, in _train_epoch
for batch_idx, data in enumerate(self.data_loader.train_loader):
File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1142, in _get_data
success, data = self._try_get_data()
File "/home/flynn/anaconda3/envs/vcgenv4/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3456922) exited unexpectedly
Expected behavior
Memory should be allocated in shared memory in the same style as default_collate() provided by PyTorch here.
Environment