snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 398 forks source link

On What constitutes external data in PCQM4M-LSC #118

Closed mearcstapa-gqz closed 3 years ago

mearcstapa-gqz commented 3 years ago

In Use of External Data from rules : "For each dataset, models need to be developed only using the provided data." In Overview in PCQM4M : "Moreover, predicting the quantum chemical property only from 2D molecular graphs without their 3D equilibrium structures is also practically favorable. This is because obtaining 3D equilibrium structures requires DFT-based geometry optimization, which is expensive on its own."

So, is it allowed to use 3D coordinates in model developing? For example, use MolToXYZ from rdkit after this line. https://github.com/snap-stanford/ogb/blob/c118ba18f0ca13cc8fdf03c6ead13f64316fbe0b/ogb/utils/mol.py#L13 Moreover, is it allowed to use some commonly used package other than rdkit in computational chemistry to get input features, such as pyscf?

mearcstapa-gqz commented 3 years ago

And other rdkit features: Gasteiger charge, Crippen and others.

weihua916 commented 3 years ago

Thank you for raising the issue. We will discuss this internally and get back to you!

weihua916 commented 3 years ago

Hi! We have released a new set of rules for the PCQM4M-LSC dataset, which should address your questions here. In short, you can use those packages as long as the test-time-inference can be performed within the limited computational budget. The 3D information can also be used as long as it can be obtained within the budget.

Hope this helps, and please let us know if you have any questions or clarifications!

mearcstapa-gqz commented 3 years ago

A misclaim above: For a Mol just generated via MolFromSmiles, to use MoltoXYZBlock to get coordinates, one should do EmbedMultipleConfs, MMFFOptimizeMoluecule, CalcEnergy etc. to get conformation before it.

For example, use MolToXYZ from rdkit after this line.

mearcstapa-gqz commented 3 years ago

Hi! We have released a new set of rules for the PCQM4M-LSC dataset, which should address your questions here. In short, you can use those packages as long as the test-time-inference can be performed within the limited computational budget. The 3D information can also be used as long as it can be obtained within the budget.

Hope this helps, and please let us know if you have any questions or clarifications!

Thanks for the clarifications! Must say the 0.1 sec per molecule budget is well placed lol.