snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.94k stars 401 forks source link

Predictor setting in inference for ogbl-vessel #359

Closed skepsun closed 2 years ago

skepsun commented 2 years ago

https://github.com/snap-stanford/ogb/blob/f5534d99703ab549ae4f7279f2002c6cc79041dc/examples/linkproppred/vessel/gnn.py#L139

I found that the predictor is not set to predictor.eval() in the test function in gnn.py, which may result in the poor performance for GNN on this dataset. If predictor.eval() is added, even with hidden size 3, the test ROC-AUC of GCN may reach 70+%, although sometimes it is stuck at 50%.

jqmcginnis commented 2 years ago

Hi @skepsun,

I am very grateful that you have discovered and shared this with us! Thank you so much!

Can you share the exact settings (e.g. python gnn.py --num_layers ...) you used for reaching 70+%?

If you want you can implement the bugfix in a PR, otherwise I am happy to do it :)

skepsun commented 2 years ago

Hi @jqmcginnis , I just made a PR to fix it. The command to reproduce possible 70+% results is:

python gnn.py --hidden_channel 3 --num_layer 2 --dropout 0.5 --lr 0.000001 --epochs 100
skepsun commented 2 years ago

I also discovered that exchanging positive and negative edges in training (by using pos_loss = -torch.log(1-pos_out + 1e-15).mean() and neg_loss = -torch.log(neg_out + 1e-15).mean()) will not significantly affect the final results... And training scores become lower/higher when val&test scores become higher/lower.

jqmcginnis commented 2 years ago

I also discovered that exchanging positive and negative edges in training (by using pos_loss = -torch.log(1-pos_out + 1e-15).mean() and neg_loss = -torch.log(neg_out + 1e-15).mean()) will not significantly affect the final results... And training scores become lower/higher when val&test scores become higher/lower.

@weihua916 do you see this as a viable option? I am also unsure how to deal with this situation in the best possible way. As you are very experienced and accomplished in this field, I would very much appreciate your thoughts on this.

To provide some more background, let us consider the output of the following script:

python gnn.py --hidden_channel 128 --num_layer 2 --dropout 0.0 --lr 0.0001 --epochs 100

Using the non-inverted training process we obtain:

Run: 01, Epoch: 01, Loss: 1.3865, Train: 0.6315, Valid: 0.2660, Test: 0.2671                                                                                                                               
Run: 01, Epoch: 02, Loss: 1.3862, Train: 0.6312, Valid: 0.2660, Test: 0.2671
Run: 01, Epoch: 03, Loss: 1.3857, Train: 0.6305, Valid: 0.2662, Test: 0.2672
...
Run: 01, Epoch: 10, Loss: 1.3191, Train: 0.6269, Valid: 0.2680, Test: 0.2689
Run: 01, Epoch: 11, Loss: 1.3131, Train: 0.6269, Valid: 0.2680, Test: 0.2689
Run: 01, Epoch: 12, Loss: 1.3080, Train: 0.6269, Valid: 0.2679, Test: 0.2689
Run: 01, Epoch: 13, Loss: 1.3028, Train: 0.6269, Valid: 0.2679, Test: 0.2689
Run: 01, Epoch: 14, Loss: 1.2975, Train: 0.6269, Valid: 0.2679, Test: 0.2689
Run: 01, Epoch: 15, Loss: 1.2917, Train: 0.6269, Valid: 0.2679, Test: 0.2689
Run: 01, Epoch: 16, Loss: 1.2855, Train: 0.6269, Valid: 0.2679, Test: 0.2689
...
Run: 01, Epoch: 21, Loss: 1.2502, Train: 0.6675, Valid: 0.2918, Test: 0.2928
Run: 01, Epoch: 22, Loss: 1.2438, Train: 0.6659, Valid: 0.3057, Test: 0.3067
Run: 01, Epoch: 23, Loss: 1.2385, Train: 0.6681, Valid: 0.2975, Test: 0.2985
Run: 01, Epoch: 24, Loss: 1.2339, Train: 0.6690, Valid: 0.2999, Test: 0.3009
Run: 01, Epoch: 25, Loss: 1.2301, Train: 0.6705, Valid: 0.3072, Test: 0.3082
Run: 01, Epoch: 26, Loss: 1.2271, Train: 0.6709, Valid: 0.3037, Test: 0.3047
Run: 01, Epoch: 27, Loss: 1.2246, Train: 0.6708, Valid: 0.3040, Test: 0.3050
Run: 01, Epoch: 28, Loss: 1.2225, Train: 0.6714, Valid: 0.3087, Test: 0.3098
Run: 01, Epoch: 29, Loss: 1.2209, Train: 0.6722, Valid: 0.3103, Test: 0.3114
Run: 01, Epoch: 30, Loss: 1.2195, Train: 0.6718, Valid: 0.3084, Test: 0.3094
Run: 01, Epoch: 31, Loss: 1.2183, Train: 0.6728, Valid: 0.3098, Test: 0.3108
Run: 01, Epoch: 32, Loss: 1.2173, Train: 0.6721, Valid: 0.3118, Test: 0.3128
Run: 01, Epoch: 33, Loss: 1.2164, Train: 0.6723, Valid: 0.3129, Test: 0.3139
...

and the same script and settings for inverted loss @skepsun recommended yields the following output:

Run: 01, Epoch: 01, Loss: 1.3871, Train: 0.3686, Valid: 0.7340, Test: 0.7329                                                                                                                               
Run: 01, Epoch: 02, Loss: 1.3863, Train: 0.3688, Valid: 0.7340, Test: 0.7329                                                                                                                               
Run: 01, Epoch: 03, Loss: 1.3857, Train: 0.3694, Valid: 0.7338, Test: 0.7328  
...                                                                                                                   
Run: 01, Epoch: 09, Loss: 1.3286, Train: 0.3731, Valid: 0.7320, Test: 0.7310                                                                                                                               
Run: 01, Epoch: 10, Loss: 1.3176, Train: 0.3732, Valid: 0.7320, Test: 0.7310                                                                                                                               
...                                                                                                                
Run: 01, Epoch: 21, Loss: 1.2470, Train: 0.3319, Valid: 0.7095, Test: 0.7085                                                                                                                               
Run: 01, Epoch: 22, Loss: 1.2413, Train: 0.3320, Valid: 0.7019, Test: 0.7010                                                                                                                               
Run: 01, Epoch: 23, Loss: 1.2364, Train: 0.3318, Valid: 0.7026, Test: 0.7017                                                                                                                               
Run: 01, Epoch: 24, Loss: 1.2321, Train: 0.3303, Valid: 0.6966, Test: 0.6956                                                                                                                               
...   
weihua916 commented 2 years ago

Interesting. It looks like the model optimization is quickly stuck in the local minima. I'd suggest you at least make sure your model overfits (nearly 1.0 ROC-AUC) towards the end of the training. Also, I feel the absolute 3D coordinate is not appropriate as input to your model. Using the relative displacements (x1,y1,z1) - (x2,y2,z2) as edge features makes more sense. In any case, this is just a baseline; for now, please make sure that the dataset itself is correct, and the community will figure out the best way to tackle this problem.

jqmcginnis commented 2 years ago

I also discovered that exchanging positive and negative edges in training (by using pos_loss = -torch.log(1-pos_out + 1e-15).mean() and neg_loss = -torch.log(neg_out + 1e-15).mean()) will not significantly affect the final results... And training scores become lower/higher when val&test scores become higher/lower.

I discussed this with other students in my group, and we decided against employing this trick for the leaderboard submissions due to the following reasons:

  1. GCN and GraphSAGE have a very hard time overfitting to the training set, as these two cannot tackle the problem adequately (given the current problem formulation). Moreover, they generalize very poorly (worse than random choice). We believe swapping pos and neg loss would create the feeling of GCN and GraphSAGE performing unrealistically well on this dataset, when in fact we do not trust the model at all. For me it would feel like making the model intentionally bad, and doing the opposite of its prediction.
  2. I tried multiple models during hyperparameter, but I could not find a model with ROCAUC going to 1.00. People with higher GPU RAM (we had Nvidia Quadro RTX 8000 Ti (48GB GPU)) might test other hyperparametes as well (more hidden channels e.g. >256 (512, 1024 ...)) or num_layers (>3)). However, I feel in the current scenario (with X,Y, Z coordinates), it's the best the GNN can do.
  3. If we look at SEAL which uses the node labeling trick - it already outperforms them by high margins, even without using features :slightly_smiling_face:, so it is possibility to perform better than random guessing :slightly_smiling_face:. We feel it is an interesting (yet challenging task), but it can be accomplished - making this a very nice dataset for the community to tackle. Also we encourage the community to test ideas and concepts such as @weihua916 's relative displacement trick.

Lastly, we love hearing your ideas and tricks to improve ogbl-vessel and its algorithms, and are happy to decide any questions you might have.

Thank you very much for the feedback!

Cheers, Julian

weihua916 commented 2 years ago

Thank you Julian!

It'd be cool to see on the leaderboard how SEAL performs. Also, we should keep in mind that ROC-AUC is often an optimistic measure for link prediction. You can get 99.9% ROC-AUC while achieving only 10% Hits@50. The score really depends on how difficult the negative examples are. Just good to keep this in mind when we assess the ROC-AUC score.

jqmcginnis commented 2 years ago

@weihua916 thank you very much for your comment!

I am still waiting for the final SEAL results (with 10 runs), the algorithm is comparatively slow but we're getting there :slightly_smiling_face:

Thank you very much for bringing the decision of ROC-AUC score as an evaluation metric to our attention again, we're eager to look into all these topics with ogbl-vessel, and are curious what the community thinks and implements!

skepsun commented 2 years ago

@jqmcginnis Thanks for updating SEAL results!

I have a question about the train scores during training. I tried GCN without any tricks and reached (for several times, not always) 70% val/test ROC-AUC with ~35% train ROC-AUC. I also tried to implement GCN+NeighborSampler with DGL and can stably reach 73% val/test ROC-AUC with ~35% train ROC-AUC. I am very curious about whether SEAL reached 80% val/test ROC-AUC with <50% training ROC-AUC.

jqmcginnis commented 2 years ago

@skepsun happy to hear you are still working on this! :slightly_smiling_face:

Yes, SEAL_OGB is able to perform similarly well on the train set, e.g. this is the report after the first training epoch:

Command line input: python seal_link_pred_train.py --dataset ogbl-vessel --use_feature

SortPooling k is set to 10
100%|███████████████████████| 267295/267295 [1:49:55<00:00, 40.53it/s]
100%|███████████████████████| 267295/267295 [1:28:08<00:00, 50.54it/s]^B
100%|███████████████████████████| 33412/33412 [12:24<00:00, 44.89it/s]
100%|███████████████████████████| 33412/33412 [11:37<00:00, 47.91it/s]
Run: 01, Epoch: 01, Loss: 0.5186, Train: 80.76%, Valid: 80.82%, Test: 80.79%

The SEAL version on the SEAL_OGB master branch does not automatically compute the training scores, however, if you would like to run it yourself and also track the training process, feel free to use the custom implementation I've implemented, i.e. ogbl-vessel branch in my fork which also calculates the training scores :slightly_smiling_face:

I've also noticed that the OGB leaderboard has received another submission ("SAGE+JKNet") which seems to achieve similar ROC-AUC scores, so I do think the simplicity of GCN and SAGE might be the problem.

Happy to hear your feedback!