.process() Function for QM9 Gives Incorrect Featurization

RafiBrent commented 1 year ago

🐛 Describe the bug

The .process() function of the QM9 dataset incorrectly sets several features to zero for all graphs in the dataset. Specifically, the one-hot encoding for hybridization state, the binary feature indicating whether an atom is aromatic, and the binary feature indicating whether a bond is aromatic are all uniformly zero across the dataset.

dataset=QM9(/path/to/dataset) dataset.process() dataset_processed=torch.load(/path/to/dataset/processed/data_v3.pt) print(torch.max(dataset_processed[0].x[:,6:10]) print(torch.max(dataset_processed[0].edge_attr[:,3])

The above code will print the zero tensor twice, demonstrating that something has gone wrong with the .process() function.

Furthermore, the current .process() function omits Hydrogen atoms for certain molecules, a more serious mistake that actually affects the underlying molecular graph. An example of this can be seen as follows:

lbound=dataset_processed[1]['x'][23] ubound=dataset_processed[1]['x'][24] print(dataset_processed[0].x[lbound:ubound,:]) print(dataset_processed[0].smiles[23])

The above code prints a feature set corresponding to only five nodes, one of which is a Hydrogen, while from the SMILES string it is clear that there should actually be eight nodes, four of which are Hydrogens.

After some experimentation, I have found that the above problems can be solved by performing several of the RDKit operations in the .process() function using a SMILES string rather than an output of the SDMolSupplier. This approach provides nonzero values as appropriate for the previously-incorrect features. It also appears to resolve the aforementioned error regarding incorrectly-omitted Hydrogens (although I plan to test this more thoroughly). This method does throw an exception for approximately 1% of molecules in the original QM9 dataset, due to failures of RDKit to add Hydrogens to certain SMILES strings, and is therefore an imperfect fix. However, I believe that the resulting slightly-smaller version of QM9 would be significantly more valuable for research purposes than the current version with incorrect features and some incorrect molecular graphs. I am happy to share the code for my workaround or to provide any information that would be helpful for finding a more optimal solution.

Environment

PyG version: 2.4.0
PyTorch version: 2.1.0
OS: Linux
Python version: 3.10.12
CUDA/cuDNN version: CUDA 11.8
How you installed PyTorch and PyG (conda, pip, source): pip
Any other relevant information (e.g., version of torch-scatter): torch-cluster-1.6.3, torch-scatter-2.1.2, torch-sparse-0.6.18, torch-spline-conv-1.2.2

rusty1s commented 1 year ago

Thanks for the issue. Last time I checked our QM9 dataset matches exactly with what, e.g., DimeNet is using, so maybe it is a good idea to provide two datasets: (1) One without any engineered features that uses the full dataset (processed by DimeNet, etc) (2) One with engineered features that operates on a smaller version as a result.

WDYT?

RafiBrent commented 12 months ago

Thanks for your response, and that sounds like a good solution. It would also be helpful to update the documentation for both datasets, and I can provide any information that would be useful for that purpose. Some basic stats to include are that 1403 molecules in QM9 are absent from QM9_Featurized and that (astonishingly) 25715 molecular graphs from the original QM9 were missing some number of Hydrogens.

I was going to prepare a pull request using my SMILES-based workaround, but during testing I found a much easier fix (only modifies three lines of code!). Happy to still submit a PR or just list the changes in this thread, whichever would be easier.

rusty1s commented 12 months ago

If you can submit a PR and we go from there, this would be highly appreciated :) Thanks in advance.

xnuohz commented 12 months ago

FYI，DGL also implements two QM9 datasets, which may be related to this. https://docs.dgl.ai/generated/dgl.data.QM9Dataset.html#dgl.data.QM9Dataset https://docs.dgl.ai/generated/dgl.data.QM9EdgeDataset.html#dgl.data.QM9EdgeDataset

RafiBrent commented 12 months ago

Thanks for mentioning this @xnuohz , I wasn't aware of the other version of the dataset. The distinction between QM9Dataset and QM9EdgeDataset appears to be a bit different from the issue I noticed, but it could potentially be valuable to include an analogue of the DGL QM9Dataset in PyG as well.

@rusty1s I just submitted the PR. One thing worth noting is that the downloadable version of the processed dataset (for users without RDKit installed) will need to be added accordingly. I've changed the url name to 'https://data.pyg.org/datasets/qm9_v3_featurized.zip' but (as far as I'm aware) can't directly upload a zip file to that URL.

pyg-team / pytorch_geometric

.process() Function for QM9 Gives Incorrect Featurization #8370

🐛 Describe the bug

Environment