Incorrect HDF5 layout when a SMILES contains `/`

raimis commented 2 years ago

A SMILES of trans-butene (C/C=C/C) results into nested groups:

$ h5ls SPICE.hdf5/C
C=C                      Group

$ h5ls SPICE.hdf5/C/C=C
C.[Na+]                  Group

$ h5ls SPICE.hdf5/C/C=C/C.[Na+]
atomic_numbers           Dataset {13}
conformations            Dataset {9, 13, 3}
dft_total_energy         Dataset {9}
dft_total_gradient       Dataset {9, 13, 3}
formation_energy         Dataset {9}
smiles                   Dataset {1}
subset                   Dataset {1}

peastman commented 2 years ago

Thanks! It looks to me like there's no way to create a group with / in its name. I suggest replacing / with -. For consistency we should probably replace \ as well. Stereochemistry is optional in SMILES strings, so it still represents the same molecule. Does that sound reasonable?

jchodera commented 2 years ago

Stereochemistry is optional in SMILES strings, so it still represents the same molecule.

Unless we sample all possible stereochemistries for every molecule, the configuration space of these will be limited by the E/Z stereochemistry specified in this manner.

More broadly, why use the SMILES to name groups or objects anyway? SMILES is just one useful representation---we may also want to store other representations (most importantly, an atom index mapped SMILES so we know which atom is which. InChI and IUPAC names could also be of interest.

Then it's easy to traverse the groups and find molecules of interest via multiple different properties. The h5py API even provides a handy visit() method for making this simple.

peastman commented 2 years ago

The group names are purely for user convenience. For PubChem molecules, they're substance IDs. For dipeptides, they're amino acid sequences. In all cases, the canonical mapped SMILES representation is present in a dedicated field. See https://github.com/openmm/spice-dataset/blob/main/downloader/README.md for more details.

jchodera commented 2 years ago

How will this work if we have both stereochemistries in a dataset? Will you add unique suffixes to ensure there are no collisions?

raimis commented 2 years ago

I suggest replacing / with -. For consistency we should probably replace \ as well. Stereochemistry is optional in SMILES strings, so it still represents the same molecule. Does that sound reasonable?

In many case (including but-2-ene) tras and cis isomers can be separated experimentally. So for a practical propose, they are two different molecules.

The group names are purely for user convenience. For PubChem molecules, they're substance IDs. For dipeptides, they're amino acid sequences. In all cases, the canonical mapped SMILES representation is present in a dedicated field.

SMILES are a bit unwieldy as the group names due to the non-alphanumeric characters. Maybe, we can use hashes of SMILES instead.

peastman commented 2 years ago

The downloader script doesn't choose what the group names should be. It just uses whatever names are stored in QCArchive. Those names were chosen when the datasets were created. Whenever possible, the names are something more informative. SMILES strings were only used when we didn't have anything better, like for the DES molecules.

In many case (including but-2-ene) tras and cis isomers can be separated experimentally.

We still have the canonical SMILES string in a separate field, so no information is being lost.

raimis commented 2 years ago

If you replace \ and / with -, potentially there maybe be two different molecules with the same name. What the downloader is going to do?

Override the previous molecule
Create two molecules with the same
Crash...

peastman commented 2 years ago

C/C=C/C is currently the only group that contains a /. There are no groups that contain \, and there's no group called C-C=C-C. This won't create any conflicts.

If new data should be added in the future that does have a conflicting name, create_group() will throw an exception. That's an issue for any future dataset with a molecule name that matches an existing one, since all group names are required to be unique. It's up to us to make sure we don't reuse names in future datasets.

By the way, I'm confused: where did that molecule come from? It's not a DES370K molecule, and the only other dataset that uses SMILES strings as names is the ion pairs.

peastman commented 2 years ago

It's coming from the DES370K supplement set. It has records for the name C/C=C/C.[Na+]. But that isn't the name used for the molecule in the input files. It should be just CC=CC, not C/C=C/C. @pavankum any idea where that name came from?

pavankum commented 2 years ago

@peastman the supplement set is a resubmission of failed calculations from DES370K without mbis property calculation. The dataset is prepared by appending the molecules from the failed run inputs. That process might have created a different name with the smiles string. This is the failed molecule's smiles in DES370K original submission, and there are a few neighbors with that substructure. Am I understanding it right that the surrounding molecules with C/C=C/C are grouped as CC=CC in original submission but not in the supplement?

peastman commented 2 years ago

Compare it to the original input file. The names have gotten changed, and every instance of CC=CC has turned into C/C=C/C. We can work around it in the downloader, but it would be good to figure out why it happened. The molecule names in the input files are meant to be used directly, not interpreted as SMILES strings, because in a lot of cases they aren't SMILES strings.

pavankum commented 2 years ago

Ahh, okay. Thank you for raising that issue, I will check qcsubmit's hdf5 processing and why the name is not propagated throughout.

jchodera commented 2 years ago

Compare it to the original input file https://github.com/openmm/spice-dataset/blob/main/des370k/des370K.hdf5. The names have gotten changed, and every instance of CC=CC has turned into C/C=C/C. We can work around it in the downloader, but it would be good to figure out why it happened.

If we retrieved the SMILES via the OpenFF Molecule.to_smiles() method, we only use canonical isomeric SMILES, so it likely took the molecular identity from the stereochemistry if the first snapshot that was processed in this case.

If we want to use a different original identifier to name things, then we probably have to preserve original molecule titles through the QCFractal submission process.

Message ID: @.***>

peastman commented 2 years ago

Actually, I think there's something more going on. Here is how various sources refer to that molecule.

In the original input file, it's "CC=CC".

In the submission, it's "C\C=C\C".

Using FractalClient to retrieve the results for the main DES370K dataset, it's "cc=cc" (lower case).

Using FractalClient to retrieve the results for the supplement, it's "C\C=C\C".

How does the name returned by FractalClient get chosen? It's different for the supplement than for the main dataset, and both of them are different from what was in the original file.

openmm / spice-dataset

Incorrect HDF5 layout when a SMILES contains `/` #30