microsoft / tf-gnn-samples

TensorFlow implementations of Graph Neural Networks
MIT License
910 stars 229 forks source link

QM9 Mulliken partial charges #19

Open gasteigerjo opened 3 years ago

gasteigerjo commented 3 years ago

Thank you for making the code public and helping with the target value scaling last time!

Looking at the source code I've now discovered that you are using the QM9 atom features as-is. However, your version of the dataset includes the Mulliken partial charges, as described in the original paper: https://www.nature.com/articles/sdata201422/

It is rather easy to spot since it is the only float-valued node feature (number 7). This feature should not be available to the model. You can only obtain these charges via QM-based calculations based on the molecular geometries. This is therefore an output, not an input.

In my opinion, this feature should be excluded from the dataset, as done e.g. in the Gilmer MPNN paper. Otherwise you have information leakage. This seems to be an accident -- or did I overlook something? What are your thoughts?

mmjb commented 3 years ago

This is indeed seems like an accident - I have to admit that I never delved as deeply into this dataset as you have done, but instead simply used this because it was conveniently lying around. To be honest, the correct thing to do here is to step away from this dataset (which is not super-representative if you're looking at computational chemistry, but keeps getting cited for that anyway...).

That being said, it's a bit unclear to me what to do in the codebase here; I feel there's an argument to keep the code as-is to match the published paper, but maybe add a note about this issue?

gasteigerjo commented 3 years ago

QM9 is indeed a bit strange, since equilibrium geometries are rather hard to obtain and not unique. But I think it's a good dataset nevertheless, even if there are certainly many other great Chemistry datasets out there.

Yes, I see your point. I think a good thing to do would be to (a) add a note in the readme and (b) raise a warning if people load the QM9 dataset as-is. You could even introduce a flag which causes the loader to exclude feature 7.