snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.93k stars 397 forks source link

MAG240M: The difference in accuracy of rgnn.py depending on the batch size when evaluating #201

Closed island02 closed 3 years ago

island02 commented 3 years ago

In rgnn.py, I reproduced the accuracy described in README.md (70.48%) for the validation data.

Testing: 100%|███████████████████████████████████████████████████████████████████| 8685/8685 [10:28<00:00, 13.82it/s]
--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.7048651576042175}`

On the other hand, I noticed that if I change the hard-coded batch_size in line 433 to speed up the operation, the accuracy changes. For example, when I set batch_size=1024, the accuracy decreased.

DATALOADER:0 TEST RESULTS
{'test_acc': 0.6956417560577393}

On the other hand, if I set batch_size=1, the accuracy increased.

DATALOADER:0 TEST RESULTS
{'test_acc': 0.7176515460014343}

I wondered if the remainder of the data divided into batches was being truncated, but even if 1,000 pieces of data were being truncated, I wondered why the accuracy would change by 1% for over 130,000 validation data.

In addition, I tried to output y_pred using save_test_submission() in mag240m.py, but it was almost the same regardless of batchsize.

Why does changing the datamodule.batch_size affect the accuracy? Is the accuracy calculated by sampling only a portion of the validation data, and does it vary with batchsize?

weihua916 commented 3 years ago

Hi! It is because of the bug in pytorch-lightning in computing accuracy: see https://github.com/snap-stanford/ogb/discussions/141#discussioncomment-584011 and https://github.com/PyTorchLightning/pytorch-lightning/issues/6889 I believe batch_size=1 gives you the most accurate result, but you may need to re-implement the evaluation code until the bug in pytorch-lightning is fixed.

island02 commented 3 years ago

Thanks for the answer, and sorry it was already mentioned in the discussion. I was also trying to turn datamodule.sizes to a ridiculously large value like him to eliminate uncertainty. However, I was having trouble reproducing the high accuracy of 71.7% that trainer.test outputs when batchsize=1, when I recalculated using the output file. As it is a bug in pytorch-lightning, I'm going to trust the accuracy of my own calculations.

weihua916 commented 3 years ago

Yes, I believe the correct code should give you 71.7%. Let us know if this is the case.

weihua916 commented 3 years ago

Actually, there seems to be a easy workaround: https://github.com/PyTorchLightning/pytorch-lightning/issues/6889#issuecomment-830234986

rusty1s commented 3 years ago

I will take care of adding the workaround to our examples.

rusty1s commented 3 years ago

This is now fixed in the example scripts. We will update the validation accuracy soon.

weihua916 commented 3 years ago

The validation accuracy and test accuracy have been updated with the new code: https://github.com/snap-stanford/ogb/blob/master/examples/lsc/mag240m/README.md