Node2vec : NaN value on loss computation due to random walk tensor shape (graph classification)

NiaJ3oE2LM commented 7 months ago

🐛 Describe the bug

When working with a graph classification task, I experienced 'nan' values on the loss computation from the standard node2vec model loader. The random walks sampling method was returning an elongated vector that screwed the loss computation of the model.

I am using DHFR graph collection loaded with TUDataset class. To solve this, I had to modify module nn.models.node2vec.py and set dim=1 (instead of dim=0) on the returning torch.cat tensor from methods pos_sample and neg_sample (currently lines 120 and 134).

Since these lines of node2vec.py are quite old, I fear this behavior is explained by my error in feeding the data to the loader: if this is the case I probably missed some reading in the documentation and I would be grateful if you could point me in the right direction.

Possibly related discussion: #1437

Versions

PyTorch version: 2.1.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

GCC version: (GCC) 13.2.1 20230801 Clang version: 16.0.6 CMake version: version 3.28.1 Libc version: glibc-2.38

Python version: 3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801] (64-bit runtime) Python platform: Linux-6.7.0-arch3-1-x86_64-with-glibc2.38 Is CUDA available: True CUDA runtime version: 12.3.103 CUDA_MODULE_LOADING set to: LAZY ... Nvidia driver version: 545.29.06 cuDNN version: Probably one of the following: /usr/lib/libcudnn.so.8.9.7 /usr/lib/libcudnn_adv_infer.so.8.9.7 /usr/lib/libcudnn_adv_train.so.8.9.7 /usr/lib/libcudnn_cnn_infer.so.8.9.7 /usr/lib/libcudnn_cnn_train.so.8.9.7 /usr/lib/libcudnn_ops_infer.so.8.9.7 /usr/lib/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True ... Versions of relevant libraries: [pip3] numpy==1.26.2 [pip3] torch==2.1.1 [pip3] torch-cluster==1.6.3+pt21cu121 [pip3] torch_geometric==2.4.0 [pip3] triton==2.1.0 [conda] Could not collect

rusty1s commented 7 months ago

Hey, can you clarify on the cat(dim=...) issue? What are the final shapes of the walks list? Do you have a small reproducible example that explains this behavior you observe?

NiaJ3oE2LM commented 7 months ago

Hello, yes, I can upload a synthetic example but it will take me some time. I am also publishing the project where I first encountered this behavior in a few days

rusty1s commented 7 months ago

It would be somewhat easier for me to just be able to reproduce this on a small example. If that's possible, I would appreciate your effort :)

pyg-team / pytorch_geometric

Node2vec : NaN value on loss computation due to random walk tensor shape (graph classification) #8813

🐛 Describe the bug

Versions