Heterogenous graph, use NeighborLoader with num_workers>0, and stucks after many epochs

PolarisRisingWar commented 2 years ago

🐛 Describe the bug

My code is like this:

***The code for creating graph, GNN model***

train_loader = NeighborLoader(
    train_data,
    num_neighbors=[2] * 2,
    batch_size=train_batch_size,
    input_nodes='case',
    shuffle=True,
    num_workers=4,
)

test_loader = NeighborLoader(
    test_data,
    num_neighbors=[2] * 2,
    batch_size=train_batch_size,
    input_nodes='case',
    shuffle=True, 
    num_workers=4,
)

***The code to train and test***

(I need the subgraph sampled in test_loader to be random, so I put shuffle=True and use n_id attribute to rearrange the predicted logits) I used W&B to log the train_losses and other metrics during training, but I found that after 80min and 6h (2 experiments) it stucks, the curve stop running for about 2 hours. I can only think it's because the num_workers cause after I deleted num_workers paramater, it can successfully finished the 22h process. Honestly it's hard for me to trace back the bug and reproduce it... So I can only just report this problem.

Environment

PyG version: 2.1.0.dev20220815
PyTorch version: 1.11.0
OS: Linux
Python version: 3.8.13
CUDA/cuDNN version: cuda10.2 cudnn7.6.5

How you installed PyTorch and PyG (conda, pip, source): PyTorch: conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=10.2 -c pytorch PyG:

pip install torch-scatter -f https://data.pyg.org/whl/torch-1.11.0+cu102.html
pip install torch-sparse -f https://data.pyg.org/whl/torch-1.11.0+cu102.html
pip install pyg-nightly

Any other relevant information (e.g., version of torch-scatter): torch-scatter 2.0.9 torch-sparse 0.6.14

rusty1s commented 2 years ago

Thanks for reporting. Do you have some intuition what might cause this? Is there a memory leak and memory requirements are increasing over epochs? Any guidance appreciated!

LukeLIN-web commented 2 years ago

Many workers accumulate variables may lead to out of memory? I guess.

pyg-team / pytorch_geometric

Heterogenous graph, use NeighborLoader with num_workers>0, and stucks after many epochs #5348

🐛 Describe the bug

Environment