open-reaction-database / ord-schema

Schema for the Open Reaction Database
https://open-reaction-database.org
Apache License 2.0
95 stars 27 forks source link

Option `ValidationOptions(allow_reaction_smiles_only=False)` has no effect in `validate_message` or `validate_datasets` #691

Closed alexarnimueller closed 1 year ago

alexarnimueller commented 1 year ago

Hi all

First: thanks for all your efforts on the ord-schema, it's well developed and very powerful! 💪🏼 🙌🏼

However, I recently ran into an issue when dealing with catalyst SMILES, that cannot be converted into valid molecules using the RDKit. I see that there is the option to disable valid reaction SMILES checks during the validation using ValidationOptions(allow_reaction_smiles_only=False) and passing it to validate_message or validate_datasets. However, this has no effect on any of my tested reactions or datasets.

To reproduce, here's a simple example for a dataset with one reaction:

reaction = message_helpers.load_message(filename, dataset_pb2.Dataset)
options = validations.ValidationOptions(validate_ids=False, require_provenance=False, allow_reaction_smiles_only=False)
validations.validate_message(reaction, options=options)
ValidationError: Dataset.reactions[0].inputs["12148-71-9"].components[0].identifiers[0]: RDKit 2023.03.2 could not validate SMILES identifier [O-]1(C)[Ir+]234([O-](C)[Ir+]1567[CH]=8CC[CH]7=[CH]6CC[CH]85)[CH]=9CC[CH]4=[CH]3CC[CH]92

It also does not work for validate_dataset:

validations.validate_datasets({"_TEST": reaction}, options=options)
_TEST: Reaction.inputs["12148-71-9"].components[0].identifiers[0]: RDKit 2023.03.2 could not validate SMILES identifier [O-]1(C)[Ir+]234([O-](C)[Ir+]1567[CH]=8CC[CH]7=[CH]6CC[CH]85)[CH]=9CC[CH]4=[CH]3CC[CH]92

I would expect that checking SMILES strings using RDKit is disabled when passing the option allow_reaction_smiles_only=False. I know that in this case I could actually get a valid SMILES for the Iridium catalyst, but its mainly for illustration purposes. As a workaround, I can just omit the SMILES for the catalysts, but I'd like to have them there if possible. Or would this become an issue to ultimately load them into the ORD?

Thanks in advance for the support

Versions

ord-schema         0.3.37
protobuf           3.19.6
rdkit              2023.3.2
connorcoley commented 1 year ago

Thanks, Alex. This is indeed a default part of validate_compound_identifier and we issue a ValidationError for a SMILES, InChI, or MolBlock that RDKit cannot load "properly".

This does seem like it is potentially too strict given that other programs may have more permissive valence rules, particularly for these kinds of complexes where SMILES is a somewhat awkward choice of representation (but a common one).

I would propose to @skearnes that we consider the following:

WDYT?

alexarnimueller commented 1 year ago

Thanks Connor, sounds like a good compromise 👍🏼

connorcoley commented 1 year ago

This should be all set now!

alexarnimueller commented 1 year ago

Sweet, thanks a lot!