snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 398 forks source link

Bug in RGNN author features preprocessing #165

Closed hengdashi closed 3 years ago

hengdashi commented 3 years ago

In the for loop below print("generating author features...") in rgnn.py, del inputs before del outputs will yield a segmentation fault if the in_memory option is not turned on.

rusty1s commented 3 years ago

Can you give a little bit more information? The pre-processing should be independent on whether one uses --in-memory or not. And, as far as I see, del inputs does not do any harm since outputs are already computed.

hengdashi commented 3 years ago

Yes, so I was trying to reproduce the RGNN performance on MAG240M dataset, but in the preprocessing stage, I got segmentation fault on the line del inputs while generating author features.

Since del inputs is directly after outputs = adj_t.matmul(inputs, reduce='mean').numpy(), I'm guessing it could be the case that outputs which is the dot product of adj_t and inputs is lazily computed and del inputs would affect the numbers in outputs. (of course I could be totally wrong).

I also tried to backtrace the code with gdb, and here's the backtrace log:

https://pastebin.com/yRrMa4Fx

FYI, my system config is as follows:

python 3.8.8 torch 1.8.1 torch-geometric main branch numpy 1.19.2

rusty1s commented 3 years ago

matmul is performed directly, it isn't lazy op. I think it is save to delete the inputs afterwards. The segmentation fault might have occurred to a different reason.

hengdashi commented 3 years ago

Any idea what might be the cause of it? Or would you mind share your exact env config so that I could try to see whether it's a bug in the newest version of torch or numpy or not.

rusty1s commented 3 years ago

I'm not sure why it does not work for you TBH. You said that it works for you using the --in_memory option, but there is no difference regarding pre-processing in these versions. You can also just use the fully-preprocessed node-feature matrix from here.

My config is:

pytorch-lightning==1.2.0rc1
pytorch==1.7.1
torch-geometric==1.7.0
numpy==1.20.1
jwkirchenbauer commented 3 years ago

Piggybacking on this by noting that I have also gotten seg faults/ core dumps but in this case when trying to actually use the --in-memory option on the gnn.py based models SAGE and GAT i.e. if --in-memory then fault. I will follow up with a trace/dump if I can.

I was wondering whether it might be because of torch>1.7 like @hengdashi and I both have

I may try downgrading my venv, it's just finnicky to get PyG,PyTorch,OGB, and CUDA to all agree so I stopped at first working which was torch==1.8+cu111

env:

pip list

torch                        1.8.0+cu111
torch-geometric                    1.7.0
torch-scatter                      2.0.6
torch-sparse                       0.6.9
...
nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0
...
Python 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21) 
[GCC 7.3.0] on linux
>>>