sustainable-processes / pura

Clean chemical data quickly
MIT License
10 stars 3 forks source link

How to resolve organometallics #9

Open marcosfelt opened 1 year ago

marcosfelt commented 1 year ago

From @rvanputt:

Resolution of complex ligands and metal complexes was more difficult. There is extensive diversity in naming here, so this should indeed be more difficult. I tried systematic names, CAS#, and trade names such as SL-J001-1, but unfortunately this didn’t yet work for most structures. I suspect including CAS and ChemSpider should improve this, as they have entries for many of these chemicals. (Could you help with that?)

I think the key here is (1) enabling more services and (2) relaxing the agreement algorithm in some cases to handle when only one service can find a compound (see #7).

Taking SL-J001-1 as an example, only PubChem could resolve this name out of the currently available services (PubChem, CIR, CAS, ChemSpider, OPSIN). Even when I looked up SL-J001-1 by its CAS number on Common Chemistry, which should be the definitive source for CAS numbers, nothing returned.

In terms of more services, here are some ideas:

@rvanputt, could you sample some of your difficult organometallics and see if they are available on Sigma or Solvias. If so, I'll look into writing a service for one of those.

rvanputt commented 1 year ago

Regarding SL-J001-1: I get a partial match in ChemSpider (they have 'Josiphos SL-J001-1' in their record), and a good match in CAS SciFinder with substance search. I'm surprised Common Chemistry doesn't know this ligand. Is it possible SciFinder has access to a different CAS database?

Addition of Sigma-Aldrich and Solvias catalogues would definitely help, but it might not be a silver bullet. Their products should already be in at least one of the databases. That makes me wonder if it is more a matter of finding them. This could, e.g., by done by allowing partial name matches for trade names or by using look-up tables for slightly different spellings ((R,R)-iPr-DuPhos, (R,R)-i-Pr-DuPhos, etc.). Again, this information is available in SciFinder, see below. Question is how to get to it. Or do you think that won't be feasible?

image

marcosfelt commented 1 year ago

I like the idea of doing different lookup names. Maybe we could have an option in the resolution algorithm to automatically replace certain "phrases" in the name with alternative spellings if the lookup fails.

An additional idea is to have a lookup table of pre-catalysts since there are a smaller number of those. I wrote something that did this for a project with all Ruthenium catalysts last year. I just used regex:

regex_lookup = {
        r"^(Chloro)([\{\}\[\]\+\-\(\),'a-zA-z\dηα]+)ruthenium\(II\) chloride": "Cl[Ru]Cl",
        r"^(Chloro)([\{\}\[\]\+\-\(\),'a-zA-z\dηα]+)\(p-cymene\)ruthenium\(II\) chloride": "Cl[Ru]Cl.CC(C)c1ccc(C)cc1",
        r"^(Dichloro)([\[\]\+\-\(\),'a-zA-z\dηα]+)ruthenium\(II\)": "Cl[Ru]Cl",
        r"^(Dichloro)([\[\]\+\-\(\),'a-zA-z\dηα]+)(?:\(p-cymene\))ruthenium\(II\)": "Cl[Ru]Cl.CC(C)c1ccc(C)cc1",
        r"(Diacetato)([\{\}\[\]\+\-\(\),'a-zA-z\dηα]+)ruthenium\(II\)": "[O-]CC(=O)[Ru]CC(=O)[O-]",
        r"(Chloro)([\[\]\+\-\(\),'a-zA-z\d]+)\(mesitylene\)ruthenium\(II\)": "Cc1cc(cc(c1)C)C.Cl[Ru]",
        r"(Chloro)([\[\]\+\-\(\),'a-zA-z\d]+)\(pyridine\)ruthenium\(II\)": "c1ccncc1.Cl[Ru]Cl",
        r"(Chloro)([\[\]\+\-\(\),'a-zA-z\d]+)ruthenium\(II\)": "Cl[Ru]",
        r"([\[\]\+\-\(\),'a-zA-z\d]+)\(p-cymene\)ruthenium\(II\) chloride": "[Ru]Cl",
        r"^(Chloro)([\{\}\[\]\+\-\(\),'a-zA-z\dηα]+)\(p-cymene\)ruthen(?:ium\(II\))?": "Cl[Ru].CC(C)c1ccc(C)cc1",
        r"^(Chloro)([\{\}\[\]\+\-\(\),'a-zA-z\dηα]+)ruthenium\(II\)": "Cl[Ru]",
}

Going even further, it may be possible to train a ML model to do this automatically (because writing regex strings takes a lot of effort, even if you know what you're doing). I wonder if we could come up with a dataset of a bunch of pre-catalysts and ligands and then just create a dataset from the enumeration of them with different spellings? We could then basically just fine-tune this model on that data.

rvanputt commented 1 year ago

I think both the look-up and regex direction (when coupled to ML) might be viable options for a majority of common cases and organic fragments. For metal precursors and chiral ligands I'm not fully sure, though. Again, this is because naming of metal precursors is hopelessly unsystematic (same problem as the chiral ligands) . This means there will be a lot of different names, and getting sufficient agreement will be difficult. For example, let's consider Rh(NBD)2BF4 as an example. A quick search on SciFinder, PubChem, and ChemSpider (by CAS#) returned a lot of known aliases:

36620-11-8 Rh(nbd)2BF4 Rhodium(1+), bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]-, tetrafluoroborate(1-) (1:1) (ACI) Bicyclo[2.2.1]hepta-2,5-diene, rhodium complex (ZCI) Borate(1-), tetrafluoro-, bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]rhodium(1+) (ZCI) Rhodium(1+), bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]-, tetrafluoroborate(1-) (9CI) (NBD)2RhBF4 Bis(bicyclo[2.2.1]hepta-2,5-diene)rhodium tetrafluoroborate Bis(norbornadiene)(tetrafluoroborato)rhodium Bis(norbornadiene)rhodium tetrafluoroborate Bis(norbornadiene)rhodium(1+) tetrafluoroborate Bis(norbornadiene)rhodium(I) tetrafluoroborate [Rh(NBD)2]BF4 [rh(norbornadiene)2]bf4 Chiralyst P374 bis[??-(2,5-norbornadiene)]rhodium(i) tetrafluoroborate Bis[η-(2,5-norbornadiene)]rhodium(I) Tetrafluoroborate [Bis[eta-(2,5-norbornadiene)]rhodium(I) Tetrafluoroborate] MFCD00671775

This made me think about how to leverage the services that are already available. I see that they probably contain what we're looking for, but it is a matter of using the right query to get to it.

In order to do that, what do you think about using an iterative approach? Think of it as a two-stage rocket. First, a given query is given to all services. Chances are it will be found in at least one of these (a ranked output would probably also be helpful here). This service will probably also have a record of its CAS number or other alias. This CAS# or alias can then be used to query the other services for a second round in order to get sufficient agreement.

Advantage of this approach is that it relies on the code you've written so far. Of course, it also comes with a risk of false positives, but if we can prioritise CAS# for the second round, I think this should be fairly minimal.

What do you think?

marcosfelt commented 1 year ago

That's brilliant! I think I could have an argument like "backup_identifier_type" (not sure about the name yet) that would be used as a fallback like you've described. I'll try to implement that today!

On Thu, 27 Oct 2022 at 17:21, rvanputt @.***> wrote:

I think both the look-up and regex direction (when coupled to ML) might be viable options for a majority of common cases and organic fragments. For metal precursors and chiral ligands I'm not fully sure, though. Again, this is because naming of metal precursors is hopelessly unsystematic (same problem as the chiral ligands) . This means there will be a lot of different names, and getting sufficient agreement will be difficult. For example, let's consider Rh(NBD)2BF4 as an example. A quick search on SciFinder, PubChem, and ChemSpider (by CAS#) returned a lot of known aliases:

36620-11-8 Rh(nbd)2BF4 Rhodium(1+), bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]-, tetrafluoroborate(1-) (1:1) (ACI) Bicyclo[2.2.1]hepta-2,5-diene, rhodium complex (ZCI) Borate(1-), tetrafluoro-, bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]rhodium(1+) (ZCI) Rhodium(1+), bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]-, tetrafluoroborate(1-) (9CI) (NBD)2RhBF4 Bis(bicyclo[2.2.1]hepta-2,5-diene)rhodium tetrafluoroborate Bis(norbornadiene)(tetrafluoroborato)rhodium Bis(norbornadiene)rhodium tetrafluoroborate Bis(norbornadiene)rhodium(1+) tetrafluoroborate Bis(norbornadiene)rhodium(I) tetrafluoroborate [Rh(NBD)2]BF4 [rh(norbornadiene)2]bf4 Chiralyst P374 bis[??-(2,5-norbornadiene)]rhodium(i) tetrafluoroborate Bis[η-(2,5-norbornadiene)]rhodium(I) Tetrafluoroborate [Bis[eta-(2,5-norbornadiene)]rhodium(I) Tetrafluoroborate] MFCD00671775

This made me think about how to leverage the services that are already available. I see that they probably contain what we're looking for, but it is a matter of using the right query to get to it.

In order to do that, what do you think about using an iterative approach? Think of it as a two-stage rocket. First, a given query is given to all services. Chances are it will be found in at least one of these (a ranked output would probably also be helpful here). This service will probably also have a record of its CAS number or other alias. This CAS# or alias can then be used to query the other services for a second round in order to get sufficient agreement.

Advantage of this approach is that it relies on the code you've written so far. Of course, it also comes with a risk of false positives, but if we can prioritise CAS# for the second round, I think this should be fairly minimal.

What do you think?

— Reply to this email directly, view it on GitHub https://github.com/sustainable-processes/pura/issues/9#issuecomment-1293773930, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGF3OR355MDB5Z46GD3K44DWFKTX7ANCNFSM6AAAAAARN75QSU . You are receiving this because you authored the thread.Message ID: @.***>

rvanputt commented 1 year ago

Cool! Curious how well it will work and what the hit rate is going to be. Let me know if you want more things to test beyond the pptx. The examples in there are already difficult ones, so if the new approach can do those, I'm hopeful it will be able to handle the rest.

marcosfelt commented 1 year ago

Okay perfect!

On Fri, 28 Oct 2022 at 11:40, rvanputt @.***> wrote:

Cool! Curious how well it will work and what the hit rate is going to be. Let me know if you want more things to test beyond the pptx. The examples in there are already difficult ones, so if the new approach can do those, I'm hopeful it will be able to handle the rest.

— Reply to this email directly, view it on GitHub https://github.com/sustainable-processes/pura/issues/9#issuecomment-1294838317, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGF3OR6IBXZ24U6U6RX7NC3WFOUQVANCNFSM6AAAAAARN75QSU . You are receiving this because you authored the thread.Message ID: @.***>

marcosfelt commented 1 year ago

@rvanputt Could you take a look at this PR and see if it's what you were thinking: https://github.com/sustainable-processes/pura/pull/14

rvanputt commented 1 year ago

Looks good! I'm curious to see how well it works!

marcosfelt commented 1 year ago

I just pushed a new version to pypi with these changes, so you can install that (instead of directly from source):

pip install -U pura

To use the new feature, add a backup_identifier_types keyword argument to resolve_identifiers:

resolved = resolve_identifiers(
    ["Josiphos SL-J001-1"],
    input_identifer_type=CompoundIdentifierType.NAME,
    output_identifier_type=CompoundIdentifierType.SMILES,
    backup_identifier_types=[CompoundIdentifierType.INCHI_KEY],
    services=[
        PubChem(),
        CIR(),
    ],
    agreement=2,
)
print(resolved)
rvanputt commented 1 year ago

I have just tested the new code with a small set of chiral ligands and metal precursors, using PubChem, CIR, and ChemSpider. It is now able to resolve more than before - very good! I've also seen some behaviour that I don't yet fully understand that I think is interesting.

First of all, the test set and code:

from pura.services import PubChem, CIR, ChemSpider
from pura.resolvers import resolve_identifiers
from pura.compound import CompoundIdentifierType

resolved = resolve_identifiers(
    [
    "(R,R)-Dipamp",
    "(R,R)-Et-DUPHOS",
    "(R)-Phanephos",
    "(R,R)-Et-BPE",
    "Taniaphos SL-T001-1",
    "(R,R)-BenzP*",
    "(R)-DM-BINAP",
    "(R)-DTBM-SEGPHOS",
    "Rh(nbd)2BF4",
    "(p-Cymene)ruthenium(II) chloride dimer",
    ],
    input_identifer_type=CompoundIdentifierType.NAME,
    output_identifier_type=CompoundIdentifierType.SMILES,
    backup_identifier_types=[CompoundIdentifierType.INCHI_KEY],
    services=[
        PubChem(),
        CIR(),
        ChemSpider(token="token"),
    ],
    agreement=2,
)

(R,R)-Dipamp, (R)-Phanephos, (R,R)-Et-BPE, Taniaphos SL-T001-1, (R)-DM-BINAP, (R)-DTBM-SEGPHOS, and (p-Cymene)rutenium(II) chloride dimer resolve well. Not sure whether this is because I included ChemSpider or because the back-up resolution works well. Did not test that. But in either case, this is a good result.

The other entries somehow were more difficult, despite these exact queries being present in at least one of the databases. For example, DuPhos and (R,R)-BenzP* are very common ligands with unambiguous structures or names. Perhaps unsurprisingly, SL-T001-1 alone did not work, but the inclusion of Taniaphos helped. This is not ideal (no partial matches then?), but can be addressed in a pre-cleaning step because Solvias' encoding is straightforward (SL-J00X for JosiPhos, SL-T00X for TaniaPhos, W00X for WalPhos, M00X for Mandyphos, etc. Have not tested how it handles the BIPHEPs (SL-A10X-X) and ChenPhos ligands).

Frustratingly, Rh(NBD)2BF4 still is not resolved. I'm not sure what the problem is. It is present in PubChem with exactly that naming. I get an error that the SMILES string is not electrically neutral, but then it just gives up.

I noticed two more things. First, execution of the queue stops when the current query cannot be resolved. This means that the code above won't run until all bad entries are commented out. Not mission critical, but a bit annoying. Makes me want to write a loop to call the code one-by-one... :)

Finally, I played around a bit with using different output- and backup identifier types. Here I am a bit puzzled. Using INCHI and INCHI_KEY works well most of the time. (I suspect not all services have this information for all examples above, even though I haven't checked.) But when using CAS_NUMBER for either identifier, unfortunately I get errors. Last output:

File "/.../pura/services/pubchem.py", line 165, in get_properties
    properties = ",".join([PROPERTY_MAP.get(p, p) for p in properties])
TypeError: sequence item 0: expected str instance, NoneType found

When I remove PubChem from the services and try it with the other two, the error changes

ValueError: [<CompoundIdentifierType.CAS_NUMBER: 7>] are invalid output identifier type(s) for ChemSpider

Does that make any sense to you?

All in all, I think this is a leap towards automatically resolving these more difficult compounds! I'm curious to hear what you think about these results. And fingers crossed you know how to work with CAS#!

marcosfelt commented 1 year ago

Thanks for the detailed feedback!

The other entries somehow were more difficult, despite these exact queries being present in at least one of the databases. For example, DuPhos and (R,R)-BenzP* are very common ligands with unambiguous structures or names. Perhaps unsurprisingly, SL-T001-1 alone did not work, but the inclusion of Taniaphos helped. This is not ideal (no partial matches then?), but can be addressed in a pre-cleaning step because Solvias' encoding is straightforward (SL-J00X for JosiPhos, SL-T00X for TaniaPhos, W00X for WalPhos, M00X for Mandyphos, etc. Have not tested how it handles the BIPHEPs (SL-A10X-X) and ChenPhos ligands).

I think this is an issue of there not being exact matches (e.g., DuPhos vs (S,S)-I-Pr-DUPHOS). I'll look into the partial names question in this issue: #18 .

Frustratingly, Rh(NBD)2BF4 still is not resolved. I'm not sure what the problem is. It is present in PubChem with exactly that naming. I get an error that the SMILES string is not electrically neutral, but then it just gives up.

Hopefully, the electrically neutral bit is a warning. I checked and Rh(NBD)2BF4 resolves only on PubChem. I want to go back and check to see if I can handle it by fixing some of the other bugs below.

I noticed two more things. First, execution of the queue stops when the current query cannot be resolved. This means that the code above won't run until all bad entries are commented out. Not mission critical, but a bit annoying. Makes me want to write a loop to call the code one-by-one... :)

I failed to mention this earlier: add the keyword argument silent=True. This will allow the algorithm to continue running even if there is an error.

Finally, I played around a bit with using different output- and backup identifier types. Here I am a bit puzzled. Using INCHI and INCHI_KEY works well most of the time. (I suspect not all services have this information for all examples above, even though I haven't checked.) But when using CAS_NUMBER for either identifier, unfortunately I get errors. Last output:

File "/.../pura/services/pubchem.py", line 165, in get_properties
    properties = ",".join([PROPERTY_MAP.get(p, p) for p in properties])
TypeError: sequence item 0: expected str instance, NoneType found

This is a bug which I'll fix (#17)

When I remove PubChem from the services and try it with the other two, the error changes

ValueError: [<CompoundIdentifierType.CAS_NUMBER: 7>] are invalid output identifier type(s) for ChemSpider

Does that make any sense to you?

This error is expected since some services are not compatible with certain CompoundIdentifierType. If you set silent=True, then the error should be skipped over. I'll make sure to mention the use of silent=True in the documentation when I get to it.

rvanputt commented 1 year ago

I failed to mention this earlier: add the keyword argument silent=True. This will allow the algorithm to continue running even if there is an error.

This is nice! Thanks a lot. Let me know if there is anything I can do to test. If all goes well, I might try some of our real set next time.

marcosfelt commented 1 year ago

I just published 0.2.1 to pypi which includes a partial name search feature for PubChem and fixes to the bugs mentioned above. So a complete example would look like:

resolved = resolve_identifiers(
    ["Josiphos SL-J001-1", "Rh(NBD)2BF4", "DuPhos"],
    input_identifer_type=CompoundIdentifierType.NAME,
    output_identifier_type=CompoundIdentifierType.SMILES,
    backup_identifier_types=[
        CompoundIdentifierType.INCHI_KEY,
        CompoundIdentifierType.CAS_NUMBER,
    ],
    services=[PubChem(autocomplete=True), CIR(), CAS(), ChemSpider()],
    agreement=1,
    silent=True,
)
print(resolved)

To install

pip install -U pura
rvanputt commented 1 year ago

I created a test set and tried to resolve them w/ 0.2.2. This set includes both previous failures and new queries in different formats. Code is below.

from pura.services import PubChem, CIR, ChemSpider
from pura.resolvers import resolve_identifiers
from pura.compound import CompoundIdentifierType

resolved = resolve_identifiers(    
    [
    "(R,R)-Dipamp",
    "(R,R)-Et-DUPHOS",
    "(R)-Phanephos",
    "(R,R)-Et-BPE",
    "Taniaphos SL-T001-1",
    "(R,R)-BenzP*",
    "(R)-DM-BINAP",
    "(R)-DTBM-SEGPHOS",
    "Rh(nbd)2BF4",
    "(p-Cymene)ruthenium(II) chloride dimer",
    "137219-86-4",
    "(2S)-1-[(1S)-1-(Dicyclohexylphosphino)ethyl]-2-(diphenylphosphino)ferrocene",
    "SL-J002-1",
    "(S)-[6,6'-Dimethoxy(1,1'-biphenyl)-2,2'-diyl]bis{bis[3,5-bis(1,1-dimethylethyl)-4-methoxyphenyl]phosphine}",
    "(R)-(+)-(6,6'-Dimethoxybiphenyl-2,2'-diyl)bis(diphenylphosphine)",
    "(S)-1-{(SP)-2-[2-(Diphenylphosphino)phenyl]ferrocenyl}ethylbis[3,5-bis-(trifluoromethyl)phenyl]phosphine",
    "1-[(RP)-2-(Dicyclohexylphosphino)ferrocenyl]ethyldi-tert-butylphosphine",
    "(R,R)-i-Pr-DUPHOS",
    "(1S,1'S,2R,2'R)-2,2'-Di-tert-butyl-2,3,2',3'-tetrahydro-1H,1'H(1,1')biisophosphindolyl",
    "Mandyphos 4-1",
    "(R)-BIDIME",
    "(S)-(+)-4,12-Bis[di(3,5-xylyl)phosphino]-[2.2]-paracyclophane",
    "[Ir(cod)Cl]2",
    "[Rh(NBD)2]BF4",
    "{(R)-1-[(Sp)-2-(Dicyclohexylphosphino)ferrocenyl]ethyldi-tert-butylphosphine}[2-(2'-amino-1,1'-biphenyl)]palladium(II) methanesulfonate",
    "RuCl(p-cymene)[(S,S)-Ts-DPEN]",
    "Chiralyst Rh1228",
    "Diacetato[(R)-2,2'-bis[di(3,5-xylyl)phosphino]-1,1'-binaphtyl]ruthenium(II)",
    "Chloro[(S)-2,2'-bis(diphenylphosphino)-1,1'-binaphthyl](p-cymene)ruthenium(II) chloride",
    "(R)-AntPhos",
    "(S,S)-BABIBOP",
    "(2R,2'R,3R,3'R)-MeO-BIBOP",
    "(S)-2-(1-Diphenylphosphino-2-methylpropan-2-yl)-4-isopropyl-4,5-dihydrooxazole",
    "1152313-76-2",
    "Ru(Me-allyl)2(COD)",
    "Pd(OAc)2",
    "(R)-ProPhos",
    "(R)-p-OH-BINAP",
    "(R)-DiFluorPhos",
    "(S)-MonoPhosTM", # Trade marked brand name
    "C58H70N2O10P2",
    ],
    input_identifer_type=CompoundIdentifierType.NAME,
    output_identifier_type=CompoundIdentifierType.SMILES,
    backup_identifier_types=[
        CompoundIdentifierType.INCHI_KEY,
        CompoundIdentifierType.CAS_NUMBER,
    ],
    services=
    [
        PubChem(autocomplete=True),
        CIR(),
        ChemSpider(token="token"),
    ],
    agreement=2,
    silent=True,
)

There are some interesting things going on here:

  1. First of all, Pd(OAc)2 misbehaves:

    .../pura/pura_test_v2.py
    Batch:   0%|                                                                                                                                          | 0/1 [00:00<?, ?it/s/.../pura/compound.py:101: UserWarning: Warning: SMILES of a mixture, rather than a pure compound, was found. [00:00<?, ?it/s]
    warnings.warn(
    [12:46:21] SMILES Parse Error: syntax error while parsing: [Pd](|OC(C)=O)|OC(C)=O
    [12:46:21] SMILES Parse Error: Failed parsing SMILES '[Pd](|OC(C)=O)|OC(C)=O' for input: '[Pd](|OC(C)=O)|OC(C)=O'
    Batch:   0%|                                                                                                                                          | 0/1 [00:01<?, ?it/s]
    Traceback (most recent call last):
    File ".../pura/pura_test_v2.py", line 7, in <module>
    resolved = resolve_identifiers(    
    File ".../pura/resolvers.py", line 491, in resolve_identifiers
    return resolver.resolve(
    File ".../pura/resolvers.py", line 237, in resolve
    return loop.run_until_complete(
    File "...//asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
    File ".../pura/resolvers.py", line 295, in _resolve
    resolved_identifiers.extend([await f for f in batch_bar])
    File ".../pura/resolvers.py", line 295, in <listcomp>
    resolved_identifiers.extend([await f for f in batch_bar])
    File ".../asyncio/tasks.py", line 611, in _wait_for_one
    return f.result()  # May raise f.exception().
    File ".../pura/resolvers.py", line 343, in _resolve_one_compound
    standardize_identifier(identifier)
    File ".../pura/compound.py", line 137, in standardize_identifier
    for a in mol.GetAtoms():
    AttributeError: 'NoneType' object has no attribute 'GetAtoms'
  2. Resolving the remaining queries took about an hour (!). It did 1-38 in 19 seconds, 1-39 in 41:46 min, and 1-40 in 01:07 h. Is this reproducible on your end?

  3. The order of compounds in Pura's output is different from my query. Is this on purpose? Doesn't matter so much, but can be a bit annoying because it creates an extra step of re-index everything.

  4. The addition of silent=True works well - thanks!

  5. Would it be possible to generate output without the additional text? Of course this is easy to remove, but again it adds an additional step. I mean this: COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1 instead of this: **[CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='**COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1**', details=None)]**.

Then, on to the results. Of 41 queries, 14 queries were perfect (non-chiral compound, axial sterochemistry, or stereochemistry available in output):

(p-Cymene)ruthenium(II) chloride dimer
137219-86-4
(R)-(+)-(6,6'-Dimethoxybiphenyl-2,2'-diyl)bis(diphenylphosphine)
(S)-[6,6'-Dimethoxy(1,1'-biphenyl)-2,2'-diyl]bis{bis[3,5-bis(1,1-dimethylethyl)-4-methoxyphenyl]phosphine}
(R)-Phanephos
(S)-(+)-4,12-Bis[di(3,5-xylyl)phosphino]-[2.2]-paracyclophane
(R)-DM-BINAP
(R)-DTBM-SEGPHOS
(R)-DiFluorPhos
(S)-2-(1-Diphenylphosphino-2-methylpropan-2-yl)-4-isopropyl-4,5-dihydrooxazole
Taniaphos SL-T001-1
(R,R)-Et-BPE
(R,R)-i-Pr-DUPHOS
(R)-ProPhos

1 query gave output, but the SMILES was not great (but not Pura's fault):

Ru(Me-allyl)2(COD)

1 query did not run:

Pd(OAc)2

4 queries returned the correct structure, but lost stereochemical information. Three of four contained stereogenic phosphorus atoms.

(R)-AntPhos
(R,R)-Dipamp
(R,R)-BenzP*
1152313-76-2

2 queries produced output that was wrong, despite agreement being set to two.

(1S,1'S,2R,2'R)-2,2'-Di-tert-butyl-2,3,2',3'-tetrahydro-1H,1'H(1,1')biisophosphindolyl
(S)-MonoPhosTM

The first is an alternative name of (S,S,R,R)-DuanPhos, for which Pura found CP(C(C)(C)C)C(C)(C)C. The second one is (S)-MonoPhos, which contains a trade mark identifier. This is quite common when scraping catalogs, etc. Somehow, it resolves to CN(C)p1oc2ccc3ccccc3c2c2c(ccc3ccccc32)o1. This is close, but the backbone is incorrectly saturated. (I see that (S)-MonoPhos itself resolves to the same SMILES, so this might be a win for Pura after all?)

The remaining 19 queries did not produce output:

Chloro[(S)-2,2'-bis(diphenylphosphino)-1,1'-binaphthyl](p-cymene)ruthenium(II) chloride
Chiralyst Rh1228
SL-J002-1
(S,S)-BABIBOP
[Ir(cod)Cl]2
Mandyphos 4-1
{(R)-1-[(Sp)-2-(Dicyclohexylphosphino)ferrocenyl]ethyldi-tert-butylphosphine}[2-(2'-amino-1,1'-biphenyl)]palladium(II) methanesulfonate
(2S)-1-[(1S)-1-(Dicyclohexylphosphino)ethyl]-2-(diphenylphosphino)ferrocene
[Rh(NBD)2]BF4
(S)-1-{(SP)-2-[2-(Diphenylphosphino)phenyl]ferrocenyl}ethylbis[3,5-bis-(trifluoromethyl)phenyl]phosphine
(R)-BIDIME
(R,R)-Et-DUPHOS
(2R,2'R,3R,3'R)-MeO-BIBOP
1-[(RP)-2-(Dicyclohexylphosphino)ferrocenyl]ethyldi-tert-butylphosphine
RuCl(p-cymene)[(S,S)-Ts-DPEN]
Rh(nbd)2BF4
Diacetato[(R)-2,2'-bis[di(3,5-xylyl)phosphino]-1,1'-binaphtyl]ruthenium(II)
(R)-p-OH-BINAP
C58H70N2O10P2

This category contains a lot of metal complexes that appear to be difficult. I included two spellings of Rh(NBD)2BF4, but neither was resolved. Interestingly, (R,R)-i-PR-DUPHOS was ok, but (R,R)-Et-DUPHOS was not. For this category I don't see an obvious reason why it doesn't work. Of course, some queries such as C58H70N2O10P2 were doomed from the start (although this is quite frequent in my real-world data sets). But also minor derivatization of a known ligand ((R)-p-OH-BINAP) apparently did not work.

So, in conclusion: currently it has success about half of the time. (Quite the improvement from v0.1, right?) Metal complexes and slightly non-standard spellings remain difficult. I suspect that adding more services might help. What do you think?

marcosfelt commented 1 year ago
  1. First of all, Pd(OAc)2 misbehaves:

Fixing this in #24

  1. Resolving the remaining queries took about an hour (!). It did 1-38 in 19 seconds, 1-39 in 41:46 min, and 1-40 in 01:07 h. Is this reproducible on your end?

  2. The order of compounds in Pura's output is different from my query. Is this on purpose? Doesn't matter so much, but can be a bit annoying because it creates an extra step of re-index everything.

The out of order thing is not obvious; it even tripped me up at first! Since I am using asynchronous calls to make things faster (by enabling multiple calls in parallel), services can return out of order. That's why I return both the input identifier and the resolved identifiers.

  1. Would it be possible to generate output without the additional text? Of course this is easy to remove, but again it adds an additional step. I mean this: COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1 instead of this: **[CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='**COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1**', details=None)]**.

I am actually for this change. resolve_identifiers is supposed to be a convenience function, and the most common use case would be just putting in a list of names and wanting back SMILES, Inchi, etc. I'll go ahead and make this change. #25

This category contains a lot of metal complexes that appear to be difficult. I included two spellings of Rh(NBD)2BF4, but neither was resolved. Interestingly, (R,R)-i-PR-DUPHOS was ok, but (R,R)-Et-DUPHOS was not. For this category I don't see an obvious reason why it doesn't work. Of course, some queries such as C58H70N2O10P2 were doomed from the start (although this is quite frequent in my real-world data sets). But also minor derivatization of a known ligand ((R)-p-OH-BINAP) apparently did not work.

So, in conclusion: currently it has success about half of the time. (Quite the improvement from v0.1, right?) Metal complexes and slightly non-standard spellings remain difficult. I suspect that adding more services might help. What do you think?

I'm beginning to think that we're probably topping out what's possible with API based services. You mentioned scraping catalogues, would crazy would it be to try to assemble a dataset the compiles all the organometallic catalogues? I feel like these would cover the majority right?

rvanputt commented 1 year ago

5. The order of compounds in Pura's output is different from my query. Is this on purpose? Doesn't matter so much, but can be a bit annoying because it creates an extra step of re-index everything.

The out of order thing is not obvious; it even tripped me up at first! Since I am using asynchronous calls to make things faster (by enabling multiple calls in parallel), services can return out of order. That's why I return both the input identifier and the resolved identifiers.

Ah, I had forgotten about the asynchronous calls. Again, it isn't a huge issue if both input and output are provided. Guess one would simply have to join the original data and Pura's output using the input as a key.

9. Would it be possible to generate output without the additional text? Of course this is easy to remove, but again it adds an additional step. I mean this: COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1 instead of this: **[CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='**COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1**', details=None)]**.

I am actually for this change. resolve_identifiers is supposed to be a convenience function, and the most common use case would be just putting in a list of names and wanting back SMILES, Inchi, etc. I'll go ahead and make this change. #25

I think you're right about that being the most common use case. Would be nice!

This category contains a lot of metal complexes that appear to be difficult. I included two spellings of Rh(NBD)2BF4, but neither was resolved. Interestingly, (R,R)-i-PR-DUPHOS was ok, but (R,R)-Et-DUPHOS was not. For this category I don't see an obvious reason why it doesn't work. Of course, some queries such as C58H70N2O10P2 were doomed from the start (although this is quite frequent in my real-world data sets). But also minor derivatization of a known ligand ((R)-p-OH-BINAP) apparently did not work. So, in conclusion: currently it has success about half of the time. (Quite the improvement from v0.1, right?) Metal complexes and slightly non-standard spellings remain difficult. I suspect that adding more services might help. What do you think?

I'm beginning to think that we're probably topping out what's possible with API based services. You mentioned scraping catalogues, would crazy would it be to try to assemble a dataset the compiles all the organometallic catalogues? I feel like these would cover the majority right?

  • Strem
  • Solvias
  • Sigma

Hmm, you might indeed be right that this is about the limit. Scraping catalogues would probably work for common materials, although not all of them will have SMILES and InChI available. There might be a lot of curation still to be done. I'd also add Umicore for metal complexes and potentially abcr for ligands.

Would the idea be to have an internal 'Pura' library for resolution?

marcosfelt commented 1 year ago

On the catalogs, I'm thinking I will either include a library inside pura or create an API just for organometallics. I've been able to scrape most of the CAS numbers off of Strem this evening, so I'm pretty hopeful about this direction.

rvanputt commented 1 year ago

Fingers crossed this will work! If it does, it might be nice to do the same for chiral ligands?