Please give more details about the provenance of datasets

graphdeeplearning / benchmarking-gnns

Repository for benchmarking graph neural networks

MIT License

2.52k stars 453 forks source link

It is very difficult from your paper and GitHub to really understand what we are predicting and where do the datasets come from?

For ZINC, you mention "constrained solubility", but I don't find any reference to it in the ZINC dataset that your paper references. It is not clear whether it is a computed property or a measured one, and what method is used to measure/compute the metric. Can you state more clearly the name of the used property, and make available the ZINC ID so they can be checked? Additionally, ZINC has 230 millions molecules, but you only use 12,000. How do you select the ones to include?

For CIFAR10 and MNIST, you do not mention how the images were clustered into superpixels, what method was used and what is the average resolution of the resulting image.

For PATTERN and CLUSTER, it is not mentioned what are the patterns that we are looking to find, what is the average degree of the graphs, what is the diameter distribution, what is the diameter and degree of the patterns, etc.

These pieces of information are useful to evaluate if the performance of models is satisfactory or not. I feel that the current description leaves us blind to truly understand why certain networks are better than others and make the benchmarking of GNNs more about "beating the benchmarks" than increasing the discriminative abilities.

Thank you, and great work on the paper, it was really needed in the GNN community

Hi @DomInvivo, thank you for your interest and for your much-needed feedback; I appreciate it! Based on your comments, I agree that we can present the datasets better and intend to update the paper in the next version. In general, we have detailed descriptions of all datasets along with statistics available in Appendix A. Have you had a look at it and does it help answer some of your queries?

Re. ZINC:

The constrained solubility property was selected based on pioneering work in molecule generation and our team's previous work. We will do our best to add more details in the updated paper.
We chose only 12K random samples because we initially wanted to run our benchmarks quickly and efficiently. Indeed, it would be a good idea to make the full ZINC dataset available.

Re. CIFAR and MNIST:

I believe we do mention our preprocessing technique in Appendix A. We actually follow this codebase. Let me know if more information is needed here.

Re. PATTERN and CLUSTER:

I believe we have spoken about some of your queries in Appendix A. I agree, it would be a good idea to provide information on important graph properties such as the diameter and degree.
In a nutshell, we can control the difficulty of these synthetic tasks via the Stochastic Block Model, which allows to control intra- and inter-community connectivity in the generative process. All graphs and sub-graphs (in PATTERN) are generated via SBMs, for which we give the exact configurations in the Appendix. Node features are assigned randomly.

I am happy to follow up and hear your feedback.

graphdeeplearning / benchmarking-gnns

Please give more details about the provenance of datasets #39