shchur / gnn-benchmark

Framework for evaluating Graph Neural Network models on semi-supervised node classification task
https://arxiv.org/abs/1811.05868
MIT License
444 stars 73 forks source link

The difference of Pubmed acc between 2 tables. #14

Closed Cjwd5y90 closed 1 year ago

Cjwd5y90 commented 1 year ago

I know that the data of Cora and Citeseer in your paper is different from the Planetoid's. And the Pubmed data is the same as Planetoid's. And I understand that you use random 20/30/others per class as your train/val/test split , and the Planetoid's random split uses 20 per class as train and 500/1000 as val/test. I have done experiments with the two split, and both results show GNNS'(GCN , GAT and so on) acc is closer to the Table1 and higher than Table2. So why there is a gap (8%) between 2 tables' pubmed acc?

shchur commented 1 year ago

Hi, thank you for the question! I assume you are referring to Table 2a and Table 2b.

Both 2a and 2b consider the same sizes of train/val/test sets, but different nodes are used as train/val/test sets. The point we are trying to make with this comparison is that different splits can lead to totally different results & model rankings, so experiments should be done on multiple splits instead of a single one.

image
Cjwd5y90 commented 1 year ago

Hi , thanks for your reply! The experiments and splits in the paper are very reasonable. Table1 and Talbe 2b both use random 20 per class nodes for training , then Table1 use 30/others per class for val/test , and Table2b use 500/1000 nodes for val/test. Two tables' nodes for training are both 60 and random, but i can not understand why there is a 8% pubmed's acc gap between Table 1 and Table 2b

91d5bd0f93e8939d43882938c8e8b56 44d58f1dbf3419cbcde7544ae1bc0e9
shchur commented 1 year ago

If I recall correctly (we wrote the paper a long time ago), we trained models on 100 splits for all datasets, and for Table 2b selected the split that resulted in the worst test set score. Even though, on average, models have scores around 78 on pubmed, in the worst case some of the splits can achieve a much lower score of 70. This was done for demonstration purposes to show that reporting the results on a single split can be noisy and misleading.

Cjwd5y90 commented 1 year ago

This issue has indeed troubled me for a long time. Thanks for your patient reply! Wish you a good day!