pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.48k stars 3.69k forks source link

Node2vec : NaN value on loss computation due to random walk tensor shape (graph classification) #8813

Open NiaJ3oE2LM opened 10 months ago

NiaJ3oE2LM commented 10 months ago

🐛 Describe the bug

When working with a graph classification task, I experienced 'nan' values on the loss computation from the standard node2vec model loader. The random walks sampling method was returning an elongated vector that screwed the loss computation of the model.

I am using DHFR graph collection loaded with TUDataset class. To solve this, I had to modify module nn.models.node2vec.py and set dim=1 (instead of dim=0) on the returning torch.cat tensor from methods pos_sample and neg_sample (currently lines 120 and 134).

Since these lines of node2vec.py are quite old, I fear this behavior is explained by my error in feeding the data to the loader: if this is the case I probably missed some reading in the documentation and I would be grateful if you could point me in the right direction.

Possibly related discussion: #1437

Versions

PyTorch version: 2.1.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

GCC version: (GCC) 13.2.1 20230801 Clang version: 16.0.6 CMake version: version 3.28.1 Libc version: glibc-2.38

Python version: 3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801] (64-bit runtime) Python platform: Linux-6.7.0-arch3-1-x86_64-with-glibc2.38 Is CUDA available: True CUDA runtime version: 12.3.103 CUDA_MODULE_LOADING set to: LAZY ... Nvidia driver version: 545.29.06 cuDNN version: Probably one of the following: /usr/lib/libcudnn.so.8.9.7 /usr/lib/libcudnn_adv_infer.so.8.9.7 /usr/lib/libcudnn_adv_train.so.8.9.7 /usr/lib/libcudnn_cnn_infer.so.8.9.7 /usr/lib/libcudnn_cnn_train.so.8.9.7 /usr/lib/libcudnn_ops_infer.so.8.9.7 /usr/lib/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True ... Versions of relevant libraries: [pip3] numpy==1.26.2 [pip3] torch==2.1.1 [pip3] torch-cluster==1.6.3+pt21cu121 [pip3] torch_geometric==2.4.0 [pip3] triton==2.1.0 [conda] Could not collect

rusty1s commented 10 months ago

Hey, can you clarify on the cat(dim=...) issue? What are the final shapes of the walks list? Do you have a small reproducible example that explains this behavior you observe?

NiaJ3oE2LM commented 10 months ago

Hello, yes, I can upload a synthetic example but it will take me some time. I am also publishing the project where I first encountered this behavior in a few days

rusty1s commented 10 months ago

It would be somewhat easier for me to just be able to reproduce this on a small example. If that's possible, I would appreciate your effort :)

theredchild commented 2 months ago

@NiaJ3oE2LM I was also facing the same issue. I decreased the learning rate. It is working fine now. However the dataset was different. You can try that.