Inconsistent evaluation on ogbl-collab datasets

Barcavin commented 10 months ago

Hi,

According to the rule of evaluation (https://ogb.stanford.edu/docs/leader_rules/#:~:text=The%20only%20exception,the%20validation%20labels.), the Collab for link prediction allows using the validation set during the model training. However, the example code in (https://github.com/snap-stanford/ogb/blob/master/examples/linkproppred/collab/gnn.py) seems to only use the validation set for inference rather than training. After using these validation sets as the training edges, the performance of vanilla SAGE can achieve 68+ in Hits@50.

The implementation can be found here (https://github.com/Barcavin/ogb/tree/val_as_input_collab/examples/linkproppred/collab). In fact, GCN can reach 69.45 ± 0.52 and SAGE can reach 68.20 ± 0.35. The differences between this implementation and the original example code are:

Use val as training signals and message-passing.
Only 1-layer of GNN.
Only use inner product rather than Hadamard product with a MLP.
Run for 2000 epochs.

I believe the most critical trick to make the model perform well is the learnable node embedding rather than the node attributes. To reproduce, please run python gnn.py --use_valedges_as_input [--use_sage]

Therefore, I am confused about what the correct way is to evaluate model performance on Collab.

Besides, I found that some of the submissions on the leaderboard of Collab utilize the validation set as training edges (both supervision signal and message-passing edges) while others use it only for inference (message-passing edges). This may cause an evaluation discrepancy for these models. For example, the current top-1 (GIDN@YITU) uses validation sets in the training, while ELPH uses the validation set only for inference.

Thus, I believe a common protocol for evaluating models on Collab needs to be placed for a fair comparison.

Thanks,

weihua916 commented 10 months ago

Hi! The evaluation rule is stated as is. One can use validation edges for both training and inference as long as all hyper-parameters are selected based on validation edges (not test edges). As you rightly pointed out, our example code indeed only uses the validation set for inference, but it is just for simplicity. Your example code is totally valid, but it's a bit interesting to see you are validating on validation edges while also using validation edges as training supervision. So you are essentially using training loss to do model selection? Wouldn't that cause serious over-fitting?

Barcavin commented 10 months ago

I think overfitting may not be an issue for this or 2000 epochs training has not reached overfitting yet. More indepth analysis may be needed. I also find it quite interesting that this naive method can get such a good performance.

If the results can be reproduced, should the leaderboard get updated accordingly?

On Sat, Sep 2, 2023 at 1:49 PM Weihua Hu @.***> wrote:

Hi! The evaluation rule is stated as is. One can use validation edges for both training and inference as long as all hyper-parameters are selected based on validation edges (not test edges). As you rightly pointed out, our example code indeed only uses the validation set for inference, but it is just for simplicity. Your example code is totally valid, but it's a bit interesting to see you are validating on validation edges while also using validation edges as training supervision. So you are essentially using training loss to do model selection? Wouldn't that cause serious over-fitting?

— Reply to this email directly, view it on GitHub https://github.com/snap-stanford/ogb/issues/457#issuecomment-1703718766, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGNQQIWMYC6CEIVMK3OMIXTXYLCHBANCNFSM6AAAAAA4HH2YL4 . You are receiving this because you authored the thread.Message ID: @.***>

weihua916 commented 10 months ago

Got it. Thanks for clarifying. Please feel free to submit to our leaderboard yourself.

snap-stanford / ogb

Inconsistent evaluation on ogbl-collab datasets #457