Unable to reproduce results

peihaowang commented 3 years ago

I'm trying to reproduce the reported results on OGB and ZINC datasets, but I failed to achieve the performance.

I first directly run the provided scripts hiv.sh to train a graphormer on MolHiv dataset without pretraining. The final AUC is 73.10%. Then I followed the instructions and hyper-parameter settings in the paper to do pre-training. I pre-trained on the PCQM4M for 20 epochs (until the loss converge) and fine-tuned the model on MolHiv for 8 epochs (as specified in the script) The best result turn out to be 76.25%.

Despite some improvement, the final AUC is not as high as it was reported in the paper. I also tried to reproduce the result on ZINC via the example script. But the best MAE is 0.1576, which is lower than 0.122 reported in the paper.

I'm wondering what I'm likely to miss that results in my poor performance. Can I know more reproduction details? My python environment is elaborated as below:

pytorch==1.9.0
pytorch-geometric==1.7,2
pytorch-scatter==2.0.8
pytorch-sparse==0.6.11
pytorch-lightning==1.3.0
ogb==1.3.1
cudatoolkit==11.1

I'd really appreciate it if someone could share their reproduced results and give me some suggestions.

zhengsx commented 3 years ago

Could you provide detailed logs and training scripts used in your all experiments? Besides, I notice several places in your description that may lead to different reproduction results. Thereby, I list some bullets here and recommend a double check for them in your reproduction:

The environment is not same as requirements. We don't go through all common versions of each package, but we indeed find that under certain cases, different versions may lead to different results. See here.
Have you trained for sufficient epochs? We don't use the metric "until the loss converge" to stop the training. E.g., 20 epochs for PCQM4M are far from enough, where you can find in our paper that we train for 300 epochs on this dataset. Similarly, for ZINC, we train 10K epochs. In addition, for downstream tasks, the final performance is sensitive to different ckpts, thus you can try to pick up multiple pre-trained ckpts to evaluate on downstream tasks.
You could also refer to the reproduction process done by the community, and cross validate your own reproducing procedure.
Double check the configuration of hyper-parameters used in your experiments with the paper.

peihaowang commented 3 years ago

Thanks to your reply, we manage to reproduce your results on ZINC dataset. But unfortunately, not on PCBA or MolHiv. I'm wondering if the checkpoint of the pretrained model could be provided, since it might become the key factor hindering our reproduction. (we strictly follow the hyperparameters and training recipe provided in your paper and reproduction process) BTW, I also hope to listen to the authors' points towards if Graphormer can be directly adopted to other domain datasets like social network. Or it only applies to molecular data.

zhengsx commented 3 years ago

We're willing to offer help if you strictly follow all instructions and still fail for reproduction. Please provide the detailed logs, training scripts and python environments used in your experiments for PCBA and MolHiv. Please make sure that you have resolved all potential problems listed in my previous comment.

zhengsx commented 3 years ago

Close this issue due to inactivity for a long time. Feel free to reopen it if the problem still exist.

microsoft / Graphormer

Unable to reproduce results #20