Open PolarisRisingWar opened 2 years ago
Thanks for reporting. Do you have some intuition what might cause this? Is there a memory leak and memory requirements are increasing over epochs? Any guidance appreciated!
Many workers accumulate variables may lead to out of memory? I guess.
🐛 Describe the bug
My code is like this:
(I need the subgraph sampled in test_loader to be random, so I put
shuffle=True
and use n_id attribute to rearrange the predicted logits) I used W&B to log the train_losses and other metrics during training, but I found that after 80min and 6h (2 experiments) it stucks, the curve stop running for about 2 hours. I can only think it's because the num_workers cause after I deleted num_workers paramater, it can successfully finished the 22h process. Honestly it's hard for me to trace back the bug and reproduce it... So I can only just report this problem.Environment
conda
,pip
, source): PyTorch:conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=10.2 -c pytorch
PyG:torch-scatter
): torch-scatter 2.0.9 torch-sparse 0.6.14