pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.16k stars 3.64k forks source link

Heterogenous graph, use NeighborLoader with num_workers>0, and stucks after many epochs #5348

Open PolarisRisingWar opened 2 years ago

PolarisRisingWar commented 2 years ago

🐛 Describe the bug

My code is like this:

***The code for creating graph, GNN model***

train_loader = NeighborLoader(
    train_data,
    num_neighbors=[2] * 2,
    batch_size=train_batch_size,
    input_nodes='case',
    shuffle=True,
    num_workers=4,
)

test_loader = NeighborLoader(
    test_data,
    num_neighbors=[2] * 2,
    batch_size=train_batch_size,
    input_nodes='case',
    shuffle=True, 
    num_workers=4,
)

***The code to train and test***

(I need the subgraph sampled in test_loader to be random, so I put shuffle=True and use n_id attribute to rearrange the predicted logits) I used W&B to log the train_losses and other metrics during training, but I found that after 80min and 6h (2 experiments) it stucks, the curve stop running for about 2 hours. I can only think it's because the num_workers cause after I deleted num_workers paramater, it can successfully finished the 22h process. Honestly it's hard for me to trace back the bug and reproduce it... So I can only just report this problem.

Environment

rusty1s commented 2 years ago

Thanks for reporting. Do you have some intuition what might cause this? Is there a memory leak and memory requirements are increasing over epochs? Any guidance appreciated!

LukeLIN-web commented 2 years ago

Many workers accumulate variables may lead to out of memory? I guess.