jctops / understanding-oversquashing

MIT License
41 stars 8 forks source link

reproduction of results #3

Open Hugo-Attali opened 2 years ago

Hugo-Attali commented 2 years ago

Hello and thank you for your work.

I can't find the results of the paper on the node classification task. Can I have an example reproducing the results please? Thanks

AdarshMJ commented 1 year ago

Hi @Hugo-Attali were you able to reproduce the results? @jctops I am able to run the code, I use the hyperparameters given in your paper and averaging over 10 random seeds and i am unable to reproduce the results for Cora, I get around 79.69% as opposed to 82% as reported in the paper. I was wondering if you could share your experiment suite, like how many epochs and averaged over how many runs etc. Thank you!

qubit0 commented 1 year ago

@AdarshMJ were you able to reproduce the result for any of the webkb (Texas, Cornell, Wisconsin)? These datasets have low homophily and this algorithm should work better in these cases. However, in my hands, I couldn't even get close to any of the reported numbers. It's very frustrating not being able to reproduce the result and not top of that the author doesn't seem to care about it all. They can easily upload a jupyter notebook with a nice example(that exactly reproduces their result) but I don't know why they hesitate to do so.

This might help you: when you run SDRF set use_lcc = True. If use_lcc = False then the graph doesn't change that much (at least that's what I noticed for Citeseer. The idea is to use only the largest connected component. I do not like this but this is what the Diffusion improves graph learning paper did and they also decided to do so.

AdarshMJ commented 1 year ago

@qubit0 I wasn't able to reproduce the results shown in the paper. I think if we naively plug the hyper parameters from the paper and check, we won't be able to reproduce the results so I did my own tuning.

I experimented with the following hyper parameters for the GNN training - 32,64,128 hidden dimension, LR = 0.01 and 0.001 and Weight decay is 5e-4 and also dropouts = (0.2130296 - 0.5130296). Used the SDRF hyper parameters given in the paper.

I divide the dataset as follows, after I apply the SDRF to the data

from torch_geometric.transforms import RandomNodeSplit
transform = RandomNodeSplit(split="train_rest",num_splits = 10, num_val=0.2, num_test=0.2)
data  = transform(data)
print(data)

This changes the number of samples to be used for testing and validation. I tune my GNN parameters on the validation samples and finally test it on test samples. So each split I average over 5 runs and 10 such random splits. This is the highest accuracy I am getting for citeseer.

Training for hidden_channel = 128
LR : 0.01 Dropout : 0.2130296
Final Accuracy of all splits 0.7665 ± 0.0039

For the WebKB datasets I am getting the following results-

Cornell -  0.5850 ± 0.0252 
Wisconsin - 0.5446 ± 0.0282
Texas - 0.6702 ± 0.0302

which is actually much different than what they report in the paper. I don't know how they get 70% accuracy for the Texas dataset.

And also you can measure the spectral gap before and after applying SDRF to verify if it's actually modifying the data or not, I think that's a good sanity check. Also thank you for the LCC recommendation, I will take a note of this :)

jakubbober commented 1 year ago

Hi @Hugo-Attali @AdarshMJ @qubit0, I was trying to reproduce the results for a long time and I managed to get very close. You can take a look here: https://github.com/jakubbober/discrete-curvature-rewiring. You can take a look here for the results: https://arxiv.org/abs/2207.08026. Some files in the repository are irrelevant for reproducing the results, like the ph/ directory or cheeger_bounds.py, compute_adj_powers.py or compute_cheeger.py in the experiment/ directory. The most important files are save_models.py and test_performance.py in the experiment/ directory. Note that I also test some simpler, classical types of discrete curvature. Hope this helps!

AdarshMJ commented 1 year ago

Hi @jakubbober thank you for the update. In the report you have provided above, for the Texas dataset, you're accuracy (highlighted in red) is around 65% and 68% for two runs, I get something similar too, which is around 67%. This is still not close to what they report in the paper of around 70%

jakub-bober commented 1 year ago

@AdarshMJ I think that the extra few percent might have been from the best run out of multiple runs, as this dataset is very sensitive to rewiring (each rewiring instance might have a significant effect on the accuracy, because the dataset is small).

qubit0 commented 1 year ago

@AdarshMJ In my case, I got for SDRF+undirected:

Cornell -  46.48 ± 1.22
Wisconsin - 52.63 ± 1.24
Texas - 56.81 ± 1.91

I did my test a little differently. For every mask, I ran the model 10 times then I computed the mean of those 10 values. This gave me 10 means and then I computed the mean and the standard deviation of those 10 means. Your accuracy seems to be much closer (except for Wisconsin) to the paper than mine albeit under different splits.

@jakubbober Thanks for sharing your results. Even in your case, the Wisconsin and Texas results are off, right? These datasets have low homophily and the algorithm should help with these datasets. However, it's not the case. Also, the problem with the webkb dataset is that it's very finicky because of the very uneven node label distribution, which leads to (in some train/test/val split) train set lacking certain node labels.