snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

How are the energies of PCQM4M-LSC determined? #250

Closed DomInvivo closed 2 years ago

DomInvivo commented 2 years ago

Hello,

In the OGB-LSC paper, it is mentioned that the PCQM4M-LSC dataset comes from the 2015 PubChemQC project. This project uses a semi-empirical method called PM6 to compute the low-energy conformers of the molecules, and they a Homo-Lumo gap along each of the generated conformers. Yet, the paper claims that the energy comes from DFT methods. It is unclear to me how the energy was computed.

Do you take directly the energies provided by PubChemQC (which are PM6 not DFT)? Or do you take their conformers and pass them through a DFT to get the energy at that specific conformation?

The paper explaining PM6 can be found here

Thank you for your help

nakatamaho commented 2 years ago

Hello Domlnvivo thanks for your message.

Do you take directly the energies provided by PubChemQC (which are PM6 not DFT)? Or do you take their conformers and pass them through a DFT to get the energy at that specific conformation?

The latter is correct. In PCQM4M-LSC, we take data from JCIM 2017 paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00083). We performed B3LYP/6-31G molecular geometry optimizations and then obtained B3LYP/6-31G HOMO-LUMO-gap.

More specifically, we generated an initial geometry using InChI by open babel for each molecule. Then passed to GAMESS to obtain PM3 optimized geometry then Hartree-Fock, finally we use the B3LYP method to perform molecular geometry optimization.

Best regards, Nakata Maho

DomInvivo commented 2 years ago

Thank you for confirming the method. Do you plan on releasing more information than the homo-lumo gap? Such as the 3D structures, vibration frequencies, lowest energy conformation, orbital energy, atomic charge, etc.? This kind of information could be useful for tons of ML projects.

weihua916 commented 2 years ago

We are going to release the 3D structures. Also, we are going to update the PCQM4M dataset; see here.

DomInvivo commented 2 years ago

Is it possible to release other elements capture by B3LYP, not just the 3D structures? For example the infrared vibration frequencies/intensities, the dipole moment, the partial charges of each atom?

Any additional information provided can help build models that better understand quantum chemistry.

nakatamaho commented 2 years ago

@DomInvivo See http://pubchemqc.riken.jp/b3lyp_2017.html . you can try this docker image.

weihua916 commented 2 years ago

As for PCQM4M-v2, we are planning to only include HOMO-LUMO and 3D. We want to strike a good balance between the practical usefulness of the task and the generality of the task.

DomInvivo commented 2 years ago

I understand that the HOMO-LUMO is a more important task for a benchmark. However, I believe it is not the only useful property to learn in general. I am only asking for the other properties since they can be used to train models for other projects, without being used for the benchmark.

Basically, computing the DFT for 4M molecules requires a tremendous amount of computation that not everyone has at their disposal. But since you computed it already, I am kindly asking that the full results of the DFT be open-sourced, perhaps in a CSV file outside of OGB.

I hope you understand my request

nakatamaho commented 2 years ago

Hi, Domlnvivo,

Please see http://pubchemqc.riken.jp/ for details. http://pubchemqc.riken.jp/b3lyp_2017.html you may be interested in the URL stated above. Regards, Nakata Maho

weihua916 commented 2 years ago

I understand your request, and I suggest you go to the original source and construct your desired data yourself. For OGB-LSC, we extract the part of the data that's relevant to our task, and it would be a bit too much work for us to accommodate requests beyond OGB-LSC.

DomInvivo commented 2 years ago

Thanks @nakatamaho, I didn't realize that Pubchem re-computed the original 2017 dataset using B3LYP, I thought you were sharing the PM6 computed compounds. This is exactly what I was looking for!