graphdeeplearning / benchmarking-gnns

Repository for benchmarking graph neural networks
https://arxiv.org/abs/2003.00982
MIT License
2.52k stars 453 forks source link

Please give more details about the provenance of datasets #39

Closed DomInvivo closed 4 years ago

DomInvivo commented 4 years ago

It is very difficult from your paper and GitHub to really understand what we are predicting and where do the datasets come from?

For ZINC, you mention "constrained solubility", but I don't find any reference to it in the ZINC dataset that your paper references. It is not clear whether it is a computed property or a measured one, and what method is used to measure/compute the metric. Can you state more clearly the name of the used property, and make available the ZINC ID so they can be checked? Additionally, ZINC has 230 millions molecules, but you only use 12,000. How do you select the ones to include?

For CIFAR10 and MNIST, you do not mention how the images were clustered into superpixels, what method was used and what is the average resolution of the resulting image.

For PATTERN and CLUSTER, it is not mentioned what are the patterns that we are looking to find, what is the average degree of the graphs, what is the diameter distribution, what is the diameter and degree of the patterns, etc.

These pieces of information are useful to evaluate if the performance of models is satisfactory or not. I feel that the current description leaves us blind to truly understand why certain networks are better than others and make the benchmarking of GNNs more about "beating the benchmarks" than increasing the discriminative abilities.

Thank you, and great work on the paper, it was really needed in the GNN community

chaitjo commented 4 years ago

Hi @DomInvivo, thank you for your interest and for your much-needed feedback; I appreciate it! Based on your comments, I agree that we can present the datasets better and intend to update the paper in the next version. In general, we have detailed descriptions of all datasets along with statistics available in Appendix A. Have you had a look at it and does it help answer some of your queries?

Re. ZINC:

Re. CIFAR and MNIST:

Re. PATTERN and CLUSTER:

I am happy to follow up and hear your feedback.