Open marcosfelt opened 1 year ago
Regarding SL-J001-1: I get a partial match in ChemSpider (they have 'Josiphos SL-J001-1' in their record), and a good match in CAS SciFinder with substance search. I'm surprised Common Chemistry doesn't know this ligand. Is it possible SciFinder has access to a different CAS database?
Addition of Sigma-Aldrich and Solvias catalogues would definitely help, but it might not be a silver bullet. Their products should already be in at least one of the databases. That makes me wonder if it is more a matter of finding them. This could, e.g., by done by allowing partial name matches for trade names or by using look-up tables for slightly different spellings ((R,R)-iPr-DuPhos, (R,R)-i-Pr-DuPhos, etc.). Again, this information is available in SciFinder, see below. Question is how to get to it. Or do you think that won't be feasible?
I like the idea of doing different lookup names. Maybe we could have an option in the resolution algorithm to automatically replace certain "phrases" in the name with alternative spellings if the lookup fails.
An additional idea is to have a lookup table of pre-catalysts since there are a smaller number of those. I wrote something that did this for a project with all Ruthenium catalysts last year. I just used regex:
regex_lookup = {
r"^(Chloro)([\{\}\[\]\+\-\(\),'a-zA-z\dηα]+)ruthenium\(II\) chloride": "Cl[Ru]Cl",
r"^(Chloro)([\{\}\[\]\+\-\(\),'a-zA-z\dηα]+)\(p-cymene\)ruthenium\(II\) chloride": "Cl[Ru]Cl.CC(C)c1ccc(C)cc1",
r"^(Dichloro)([\[\]\+\-\(\),'a-zA-z\dηα]+)ruthenium\(II\)": "Cl[Ru]Cl",
r"^(Dichloro)([\[\]\+\-\(\),'a-zA-z\dηα]+)(?:\(p-cymene\))ruthenium\(II\)": "Cl[Ru]Cl.CC(C)c1ccc(C)cc1",
r"(Diacetato)([\{\}\[\]\+\-\(\),'a-zA-z\dηα]+)ruthenium\(II\)": "[O-]CC(=O)[Ru]CC(=O)[O-]",
r"(Chloro)([\[\]\+\-\(\),'a-zA-z\d]+)\(mesitylene\)ruthenium\(II\)": "Cc1cc(cc(c1)C)C.Cl[Ru]",
r"(Chloro)([\[\]\+\-\(\),'a-zA-z\d]+)\(pyridine\)ruthenium\(II\)": "c1ccncc1.Cl[Ru]Cl",
r"(Chloro)([\[\]\+\-\(\),'a-zA-z\d]+)ruthenium\(II\)": "Cl[Ru]",
r"([\[\]\+\-\(\),'a-zA-z\d]+)\(p-cymene\)ruthenium\(II\) chloride": "[Ru]Cl",
r"^(Chloro)([\{\}\[\]\+\-\(\),'a-zA-z\dηα]+)\(p-cymene\)ruthen(?:ium\(II\))?": "Cl[Ru].CC(C)c1ccc(C)cc1",
r"^(Chloro)([\{\}\[\]\+\-\(\),'a-zA-z\dηα]+)ruthenium\(II\)": "Cl[Ru]",
}
Going even further, it may be possible to train a ML model to do this automatically (because writing regex strings takes a lot of effort, even if you know what you're doing). I wonder if we could come up with a dataset of a bunch of pre-catalysts and ligands and then just create a dataset from the enumeration of them with different spellings? We could then basically just fine-tune this model on that data.
I think both the look-up and regex direction (when coupled to ML) might be viable options for a majority of common cases and organic fragments. For metal precursors and chiral ligands I'm not fully sure, though. Again, this is because naming of metal precursors is hopelessly unsystematic (same problem as the chiral ligands) . This means there will be a lot of different names, and getting sufficient agreement will be difficult. For example, let's consider Rh(NBD)2BF4 as an example. A quick search on SciFinder, PubChem, and ChemSpider (by CAS#) returned a lot of known aliases:
36620-11-8 Rh(nbd)2BF4 Rhodium(1+), bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]-, tetrafluoroborate(1-) (1:1) (ACI) Bicyclo[2.2.1]hepta-2,5-diene, rhodium complex (ZCI) Borate(1-), tetrafluoro-, bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]rhodium(1+) (ZCI) Rhodium(1+), bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]-, tetrafluoroborate(1-) (9CI) (NBD)2RhBF4 Bis(bicyclo[2.2.1]hepta-2,5-diene)rhodium tetrafluoroborate Bis(norbornadiene)(tetrafluoroborato)rhodium Bis(norbornadiene)rhodium tetrafluoroborate Bis(norbornadiene)rhodium(1+) tetrafluoroborate Bis(norbornadiene)rhodium(I) tetrafluoroborate [Rh(NBD)2]BF4 [rh(norbornadiene)2]bf4 Chiralyst P374 bis[??-(2,5-norbornadiene)]rhodium(i) tetrafluoroborate Bis[η-(2,5-norbornadiene)]rhodium(I) Tetrafluoroborate [Bis[eta-(2,5-norbornadiene)]rhodium(I) Tetrafluoroborate] MFCD00671775
This made me think about how to leverage the services that are already available. I see that they probably contain what we're looking for, but it is a matter of using the right query to get to it.
In order to do that, what do you think about using an iterative approach? Think of it as a two-stage rocket. First, a given query is given to all services. Chances are it will be found in at least one of these (a ranked output would probably also be helpful here). This service will probably also have a record of its CAS number or other alias. This CAS# or alias can then be used to query the other services for a second round in order to get sufficient agreement.
Advantage of this approach is that it relies on the code you've written so far. Of course, it also comes with a risk of false positives, but if we can prioritise CAS# for the second round, I think this should be fairly minimal.
What do you think?
That's brilliant! I think I could have an argument like "backup_identifier_type" (not sure about the name yet) that would be used as a fallback like you've described. I'll try to implement that today!
On Thu, 27 Oct 2022 at 17:21, rvanputt @.***> wrote:
I think both the look-up and regex direction (when coupled to ML) might be viable options for a majority of common cases and organic fragments. For metal precursors and chiral ligands I'm not fully sure, though. Again, this is because naming of metal precursors is hopelessly unsystematic (same problem as the chiral ligands) . This means there will be a lot of different names, and getting sufficient agreement will be difficult. For example, let's consider Rh(NBD)2BF4 as an example. A quick search on SciFinder, PubChem, and ChemSpider (by CAS#) returned a lot of known aliases:
36620-11-8 Rh(nbd)2BF4 Rhodium(1+), bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]-, tetrafluoroborate(1-) (1:1) (ACI) Bicyclo[2.2.1]hepta-2,5-diene, rhodium complex (ZCI) Borate(1-), tetrafluoro-, bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]rhodium(1+) (ZCI) Rhodium(1+), bis[(2,3,5,6-η)-bicyclo[2.2.1]hepta-2,5-diene]-, tetrafluoroborate(1-) (9CI) (NBD)2RhBF4 Bis(bicyclo[2.2.1]hepta-2,5-diene)rhodium tetrafluoroborate Bis(norbornadiene)(tetrafluoroborato)rhodium Bis(norbornadiene)rhodium tetrafluoroborate Bis(norbornadiene)rhodium(1+) tetrafluoroborate Bis(norbornadiene)rhodium(I) tetrafluoroborate [Rh(NBD)2]BF4 [rh(norbornadiene)2]bf4 Chiralyst P374 bis[??-(2,5-norbornadiene)]rhodium(i) tetrafluoroborate Bis[η-(2,5-norbornadiene)]rhodium(I) Tetrafluoroborate [Bis[eta-(2,5-norbornadiene)]rhodium(I) Tetrafluoroborate] MFCD00671775
This made me think about how to leverage the services that are already available. I see that they probably contain what we're looking for, but it is a matter of using the right query to get to it.
In order to do that, what do you think about using an iterative approach? Think of it as a two-stage rocket. First, a given query is given to all services. Chances are it will be found in at least one of these (a ranked output would probably also be helpful here). This service will probably also have a record of its CAS number or other alias. This CAS# or alias can then be used to query the other services for a second round in order to get sufficient agreement.
Advantage of this approach is that it relies on the code you've written so far. Of course, it also comes with a risk of false positives, but if we can prioritise CAS# for the second round, I think this should be fairly minimal.
What do you think?
— Reply to this email directly, view it on GitHub https://github.com/sustainable-processes/pura/issues/9#issuecomment-1293773930, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGF3OR355MDB5Z46GD3K44DWFKTX7ANCNFSM6AAAAAARN75QSU . You are receiving this because you authored the thread.Message ID: @.***>
Cool! Curious how well it will work and what the hit rate is going to be. Let me know if you want more things to test beyond the pptx. The examples in there are already difficult ones, so if the new approach can do those, I'm hopeful it will be able to handle the rest.
Okay perfect!
On Fri, 28 Oct 2022 at 11:40, rvanputt @.***> wrote:
Cool! Curious how well it will work and what the hit rate is going to be. Let me know if you want more things to test beyond the pptx. The examples in there are already difficult ones, so if the new approach can do those, I'm hopeful it will be able to handle the rest.
— Reply to this email directly, view it on GitHub https://github.com/sustainable-processes/pura/issues/9#issuecomment-1294838317, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGF3OR6IBXZ24U6U6RX7NC3WFOUQVANCNFSM6AAAAAARN75QSU . You are receiving this because you authored the thread.Message ID: @.***>
@rvanputt Could you take a look at this PR and see if it's what you were thinking: https://github.com/sustainable-processes/pura/pull/14
Looks good! I'm curious to see how well it works!
I just pushed a new version to pypi with these changes, so you can install that (instead of directly from source):
pip install -U pura
To use the new feature, add a backup_identifier_types
keyword argument to resolve_identifiers
:
resolved = resolve_identifiers(
["Josiphos SL-J001-1"],
input_identifer_type=CompoundIdentifierType.NAME,
output_identifier_type=CompoundIdentifierType.SMILES,
backup_identifier_types=[CompoundIdentifierType.INCHI_KEY],
services=[
PubChem(),
CIR(),
],
agreement=2,
)
print(resolved)
I have just tested the new code with a small set of chiral ligands and metal precursors, using PubChem, CIR, and ChemSpider. It is now able to resolve more than before - very good! I've also seen some behaviour that I don't yet fully understand that I think is interesting.
First of all, the test set and code:
from pura.services import PubChem, CIR, ChemSpider
from pura.resolvers import resolve_identifiers
from pura.compound import CompoundIdentifierType
resolved = resolve_identifiers(
[
"(R,R)-Dipamp",
"(R,R)-Et-DUPHOS",
"(R)-Phanephos",
"(R,R)-Et-BPE",
"Taniaphos SL-T001-1",
"(R,R)-BenzP*",
"(R)-DM-BINAP",
"(R)-DTBM-SEGPHOS",
"Rh(nbd)2BF4",
"(p-Cymene)ruthenium(II) chloride dimer",
],
input_identifer_type=CompoundIdentifierType.NAME,
output_identifier_type=CompoundIdentifierType.SMILES,
backup_identifier_types=[CompoundIdentifierType.INCHI_KEY],
services=[
PubChem(),
CIR(),
ChemSpider(token="token"),
],
agreement=2,
)
(R,R)-Dipamp, (R)-Phanephos, (R,R)-Et-BPE, Taniaphos SL-T001-1, (R)-DM-BINAP, (R)-DTBM-SEGPHOS, and (p-Cymene)rutenium(II) chloride dimer resolve well. Not sure whether this is because I included ChemSpider or because the back-up resolution works well. Did not test that. But in either case, this is a good result.
The other entries somehow were more difficult, despite these exact queries being present in at least one of the databases. For example, DuPhos and (R,R)-BenzP* are very common ligands with unambiguous structures or names. Perhaps unsurprisingly, SL-T001-1 alone did not work, but the inclusion of Taniaphos helped. This is not ideal (no partial matches then?), but can be addressed in a pre-cleaning step because Solvias' encoding is straightforward (SL-J00X for JosiPhos, SL-T00X for TaniaPhos, W00X for WalPhos, M00X for Mandyphos, etc. Have not tested how it handles the BIPHEPs (SL-A10X-X) and ChenPhos ligands).
Frustratingly, Rh(NBD)2BF4 still is not resolved. I'm not sure what the problem is. It is present in PubChem with exactly that naming. I get an error that the SMILES string is not electrically neutral, but then it just gives up.
I noticed two more things. First, execution of the queue stops when the current query cannot be resolved. This means that the code above won't run until all bad entries are commented out. Not mission critical, but a bit annoying. Makes me want to write a loop to call the code one-by-one... :)
Finally, I played around a bit with using different output- and backup identifier types. Here I am a bit puzzled. Using INCHI and INCHI_KEY works well most of the time. (I suspect not all services have this information for all examples above, even though I haven't checked.) But when using CAS_NUMBER for either identifier, unfortunately I get errors. Last output:
File "/.../pura/services/pubchem.py", line 165, in get_properties
properties = ",".join([PROPERTY_MAP.get(p, p) for p in properties])
TypeError: sequence item 0: expected str instance, NoneType found
When I remove PubChem from the services and try it with the other two, the error changes
ValueError: [<CompoundIdentifierType.CAS_NUMBER: 7>] are invalid output identifier type(s) for ChemSpider
Does that make any sense to you?
All in all, I think this is a leap towards automatically resolving these more difficult compounds! I'm curious to hear what you think about these results. And fingers crossed you know how to work with CAS#!
Thanks for the detailed feedback!
The other entries somehow were more difficult, despite these exact queries being present in at least one of the databases. For example, DuPhos and (R,R)-BenzP* are very common ligands with unambiguous structures or names. Perhaps unsurprisingly, SL-T001-1 alone did not work, but the inclusion of Taniaphos helped. This is not ideal (no partial matches then?), but can be addressed in a pre-cleaning step because Solvias' encoding is straightforward (SL-J00X for JosiPhos, SL-T00X for TaniaPhos, W00X for WalPhos, M00X for Mandyphos, etc. Have not tested how it handles the BIPHEPs (SL-A10X-X) and ChenPhos ligands).
I think this is an issue of there not being exact matches (e.g., DuPhos vs (S,S)-I-Pr-DUPHOS). I'll look into the partial names question in this issue: #18 .
Frustratingly, Rh(NBD)2BF4 still is not resolved. I'm not sure what the problem is. It is present in PubChem with exactly that naming. I get an error that the SMILES string is not electrically neutral, but then it just gives up.
Hopefully, the electrically neutral bit is a warning. I checked and Rh(NBD)2BF4 resolves only on PubChem. I want to go back and check to see if I can handle it by fixing some of the other bugs below.
I noticed two more things. First, execution of the queue stops when the current query cannot be resolved. This means that the code above won't run until all bad entries are commented out. Not mission critical, but a bit annoying. Makes me want to write a loop to call the code one-by-one... :)
I failed to mention this earlier: add the keyword argument silent=True
. This will allow the algorithm to continue running even if there is an error.
Finally, I played around a bit with using different output- and backup identifier types. Here I am a bit puzzled. Using INCHI and INCHI_KEY works well most of the time. (I suspect not all services have this information for all examples above, even though I haven't checked.) But when using CAS_NUMBER for either identifier, unfortunately I get errors. Last output:
File "/.../pura/services/pubchem.py", line 165, in get_properties properties = ",".join([PROPERTY_MAP.get(p, p) for p in properties]) TypeError: sequence item 0: expected str instance, NoneType found
This is a bug which I'll fix (#17)
When I remove PubChem from the services and try it with the other two, the error changes
ValueError: [<CompoundIdentifierType.CAS_NUMBER: 7>] are invalid output identifier type(s) for ChemSpider
Does that make any sense to you?
This error is expected since some services are not compatible with certain CompoundIdentifierType
. If you set silent=True
, then the error should be skipped over. I'll make sure to mention the use of silent=True
in the documentation when I get to it.
I failed to mention this earlier: add the keyword argument silent=True. This will allow the algorithm to continue running even if there is an error.
This is nice! Thanks a lot. Let me know if there is anything I can do to test. If all goes well, I might try some of our real set next time.
I just published 0.2.1 to pypi which includes a partial name search feature for PubChem and fixes to the bugs mentioned above. So a complete example would look like:
resolved = resolve_identifiers(
["Josiphos SL-J001-1", "Rh(NBD)2BF4", "DuPhos"],
input_identifer_type=CompoundIdentifierType.NAME,
output_identifier_type=CompoundIdentifierType.SMILES,
backup_identifier_types=[
CompoundIdentifierType.INCHI_KEY,
CompoundIdentifierType.CAS_NUMBER,
],
services=[PubChem(autocomplete=True), CIR(), CAS(), ChemSpider()],
agreement=1,
silent=True,
)
print(resolved)
To install
pip install -U pura
I created a test set and tried to resolve them w/ 0.2.2. This set includes both previous failures and new queries in different formats. Code is below.
from pura.services import PubChem, CIR, ChemSpider
from pura.resolvers import resolve_identifiers
from pura.compound import CompoundIdentifierType
resolved = resolve_identifiers(
[
"(R,R)-Dipamp",
"(R,R)-Et-DUPHOS",
"(R)-Phanephos",
"(R,R)-Et-BPE",
"Taniaphos SL-T001-1",
"(R,R)-BenzP*",
"(R)-DM-BINAP",
"(R)-DTBM-SEGPHOS",
"Rh(nbd)2BF4",
"(p-Cymene)ruthenium(II) chloride dimer",
"137219-86-4",
"(2S)-1-[(1S)-1-(Dicyclohexylphosphino)ethyl]-2-(diphenylphosphino)ferrocene",
"SL-J002-1",
"(S)-[6,6'-Dimethoxy(1,1'-biphenyl)-2,2'-diyl]bis{bis[3,5-bis(1,1-dimethylethyl)-4-methoxyphenyl]phosphine}",
"(R)-(+)-(6,6'-Dimethoxybiphenyl-2,2'-diyl)bis(diphenylphosphine)",
"(S)-1-{(SP)-2-[2-(Diphenylphosphino)phenyl]ferrocenyl}ethylbis[3,5-bis-(trifluoromethyl)phenyl]phosphine",
"1-[(RP)-2-(Dicyclohexylphosphino)ferrocenyl]ethyldi-tert-butylphosphine",
"(R,R)-i-Pr-DUPHOS",
"(1S,1'S,2R,2'R)-2,2'-Di-tert-butyl-2,3,2',3'-tetrahydro-1H,1'H(1,1')biisophosphindolyl",
"Mandyphos 4-1",
"(R)-BIDIME",
"(S)-(+)-4,12-Bis[di(3,5-xylyl)phosphino]-[2.2]-paracyclophane",
"[Ir(cod)Cl]2",
"[Rh(NBD)2]BF4",
"{(R)-1-[(Sp)-2-(Dicyclohexylphosphino)ferrocenyl]ethyldi-tert-butylphosphine}[2-(2'-amino-1,1'-biphenyl)]palladium(II) methanesulfonate",
"RuCl(p-cymene)[(S,S)-Ts-DPEN]",
"Chiralyst Rh1228",
"Diacetato[(R)-2,2'-bis[di(3,5-xylyl)phosphino]-1,1'-binaphtyl]ruthenium(II)",
"Chloro[(S)-2,2'-bis(diphenylphosphino)-1,1'-binaphthyl](p-cymene)ruthenium(II) chloride",
"(R)-AntPhos",
"(S,S)-BABIBOP",
"(2R,2'R,3R,3'R)-MeO-BIBOP",
"(S)-2-(1-Diphenylphosphino-2-methylpropan-2-yl)-4-isopropyl-4,5-dihydrooxazole",
"1152313-76-2",
"Ru(Me-allyl)2(COD)",
"Pd(OAc)2",
"(R)-ProPhos",
"(R)-p-OH-BINAP",
"(R)-DiFluorPhos",
"(S)-MonoPhosTM", # Trade marked brand name
"C58H70N2O10P2",
],
input_identifer_type=CompoundIdentifierType.NAME,
output_identifier_type=CompoundIdentifierType.SMILES,
backup_identifier_types=[
CompoundIdentifierType.INCHI_KEY,
CompoundIdentifierType.CAS_NUMBER,
],
services=
[
PubChem(autocomplete=True),
CIR(),
ChemSpider(token="token"),
],
agreement=2,
silent=True,
)
There are some interesting things going on here:
First of all, Pd(OAc)2 misbehaves:
.../pura/pura_test_v2.py
Batch: 0%| | 0/1 [00:00<?, ?it/s/.../pura/compound.py:101: UserWarning: Warning: SMILES of a mixture, rather than a pure compound, was found. [00:00<?, ?it/s]
warnings.warn(
[12:46:21] SMILES Parse Error: syntax error while parsing: [Pd](|OC(C)=O)|OC(C)=O
[12:46:21] SMILES Parse Error: Failed parsing SMILES '[Pd](|OC(C)=O)|OC(C)=O' for input: '[Pd](|OC(C)=O)|OC(C)=O'
Batch: 0%| | 0/1 [00:01<?, ?it/s]
Traceback (most recent call last):
File ".../pura/pura_test_v2.py", line 7, in <module>
resolved = resolve_identifiers(
File ".../pura/resolvers.py", line 491, in resolve_identifiers
return resolver.resolve(
File ".../pura/resolvers.py", line 237, in resolve
return loop.run_until_complete(
File "...//asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File ".../pura/resolvers.py", line 295, in _resolve
resolved_identifiers.extend([await f for f in batch_bar])
File ".../pura/resolvers.py", line 295, in <listcomp>
resolved_identifiers.extend([await f for f in batch_bar])
File ".../asyncio/tasks.py", line 611, in _wait_for_one
return f.result() # May raise f.exception().
File ".../pura/resolvers.py", line 343, in _resolve_one_compound
standardize_identifier(identifier)
File ".../pura/compound.py", line 137, in standardize_identifier
for a in mol.GetAtoms():
AttributeError: 'NoneType' object has no attribute 'GetAtoms'
Resolving the remaining queries took about an hour (!). It did 1-38 in 19 seconds, 1-39 in 41:46 min, and 1-40 in 01:07 h. Is this reproducible on your end?
The order of compounds in Pura's output is different from my query. Is this on purpose? Doesn't matter so much, but can be a bit annoying because it creates an extra step of re-index everything.
The addition of silent=True works well - thanks!
Would it be possible to generate output without the additional text? Of course this is easy to remove, but again it adds an additional step. I mean this: COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1
instead of this: **[CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='**COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1**', details=None)]**
.
Then, on to the results. Of 41 queries, 14 queries were perfect (non-chiral compound, axial sterochemistry, or stereochemistry available in output):
(p-Cymene)ruthenium(II) chloride dimer
137219-86-4
(R)-(+)-(6,6'-Dimethoxybiphenyl-2,2'-diyl)bis(diphenylphosphine)
(S)-[6,6'-Dimethoxy(1,1'-biphenyl)-2,2'-diyl]bis{bis[3,5-bis(1,1-dimethylethyl)-4-methoxyphenyl]phosphine}
(R)-Phanephos
(S)-(+)-4,12-Bis[di(3,5-xylyl)phosphino]-[2.2]-paracyclophane
(R)-DM-BINAP
(R)-DTBM-SEGPHOS
(R)-DiFluorPhos
(S)-2-(1-Diphenylphosphino-2-methylpropan-2-yl)-4-isopropyl-4,5-dihydrooxazole
Taniaphos SL-T001-1
(R,R)-Et-BPE
(R,R)-i-Pr-DUPHOS
(R)-ProPhos
1 query gave output, but the SMILES was not great (but not Pura's fault):
Ru(Me-allyl)2(COD)
1 query did not run:
Pd(OAc)2
4 queries returned the correct structure, but lost stereochemical information. Three of four contained stereogenic phosphorus atoms.
(R)-AntPhos
(R,R)-Dipamp
(R,R)-BenzP*
1152313-76-2
2 queries produced output that was wrong, despite agreement being set to two.
(1S,1'S,2R,2'R)-2,2'-Di-tert-butyl-2,3,2',3'-tetrahydro-1H,1'H(1,1')biisophosphindolyl
(S)-MonoPhosTM
The first is an alternative name of (S,S,R,R)-DuanPhos, for which Pura found CP(C(C)(C)C)C(C)(C)C
. The second one is (S)-MonoPhos, which contains a trade mark identifier. This is quite common when scraping catalogs, etc. Somehow, it resolves to CN(C)p1oc2ccc3ccccc3c2c2c(ccc3ccccc32)o1
. This is close, but the backbone is incorrectly saturated. (I see that (S)-MonoPhos itself resolves to the same SMILES, so this might be a win for Pura after all?)
The remaining 19 queries did not produce output:
Chloro[(S)-2,2'-bis(diphenylphosphino)-1,1'-binaphthyl](p-cymene)ruthenium(II) chloride
Chiralyst Rh1228
SL-J002-1
(S,S)-BABIBOP
[Ir(cod)Cl]2
Mandyphos 4-1
{(R)-1-[(Sp)-2-(Dicyclohexylphosphino)ferrocenyl]ethyldi-tert-butylphosphine}[2-(2'-amino-1,1'-biphenyl)]palladium(II) methanesulfonate
(2S)-1-[(1S)-1-(Dicyclohexylphosphino)ethyl]-2-(diphenylphosphino)ferrocene
[Rh(NBD)2]BF4
(S)-1-{(SP)-2-[2-(Diphenylphosphino)phenyl]ferrocenyl}ethylbis[3,5-bis-(trifluoromethyl)phenyl]phosphine
(R)-BIDIME
(R,R)-Et-DUPHOS
(2R,2'R,3R,3'R)-MeO-BIBOP
1-[(RP)-2-(Dicyclohexylphosphino)ferrocenyl]ethyldi-tert-butylphosphine
RuCl(p-cymene)[(S,S)-Ts-DPEN]
Rh(nbd)2BF4
Diacetato[(R)-2,2'-bis[di(3,5-xylyl)phosphino]-1,1'-binaphtyl]ruthenium(II)
(R)-p-OH-BINAP
C58H70N2O10P2
This category contains a lot of metal complexes that appear to be difficult. I included two spellings of Rh(NBD)2BF4, but neither was resolved. Interestingly, (R,R)-i-PR-DUPHOS was ok, but (R,R)-Et-DUPHOS was not. For this category I don't see an obvious reason why it doesn't work. Of course, some queries such as C58H70N2O10P2
were doomed from the start (although this is quite frequent in my real-world data sets). But also minor derivatization of a known ligand ((R)-p-OH-BINAP
) apparently did not work.
So, in conclusion: currently it has success about half of the time. (Quite the improvement from v0.1, right?) Metal complexes and slightly non-standard spellings remain difficult. I suspect that adding more services might help. What do you think?
- First of all, Pd(OAc)2 misbehaves:
Fixing this in #24
Resolving the remaining queries took about an hour (!). It did 1-38 in 19 seconds, 1-39 in 41:46 min, and 1-40 in 01:07 h. Is this reproducible on your end?
The order of compounds in Pura's output is different from my query. Is this on purpose? Doesn't matter so much, but can be a bit annoying because it creates an extra step of re-index everything.
The out of order thing is not obvious; it even tripped me up at first! Since I am using asynchronous calls to make things faster (by enabling multiple calls in parallel), services can return out of order. That's why I return both the input identifier and the resolved identifiers.
- Would it be possible to generate output without the additional text? Of course this is easy to remove, but again it adds an additional step. I mean this:
COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1
instead of this:**[CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='**COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1**', details=None)]**
.
I am actually for this change. resolve_identifiers
is supposed to be a convenience function, and the most common use case would be just putting in a list of names and wanting back SMILES, Inchi, etc. I'll go ahead and make this change. #25
This category contains a lot of metal complexes that appear to be difficult. I included two spellings of Rh(NBD)2BF4, but neither was resolved. Interestingly, (R,R)-i-PR-DUPHOS was ok, but (R,R)-Et-DUPHOS was not. For this category I don't see an obvious reason why it doesn't work. Of course, some queries such as
C58H70N2O10P2
were doomed from the start (although this is quite frequent in my real-world data sets). But also minor derivatization of a known ligand ((R)-p-OH-BINAP
) apparently did not work.So, in conclusion: currently it has success about half of the time. (Quite the improvement from v0.1, right?) Metal complexes and slightly non-standard spellings remain difficult. I suspect that adding more services might help. What do you think?
I'm beginning to think that we're probably topping out what's possible with API based services. You mentioned scraping catalogues, would crazy would it be to try to assemble a dataset the compiles all the organometallic catalogues? I feel like these would cover the majority right?
5. The order of compounds in Pura's output is different from my query. Is this on purpose? Doesn't matter so much, but can be a bit annoying because it creates an extra step of re-index everything.
The out of order thing is not obvious; it even tripped me up at first! Since I am using asynchronous calls to make things faster (by enabling multiple calls in parallel), services can return out of order. That's why I return both the input identifier and the resolved identifiers.
Ah, I had forgotten about the asynchronous calls. Again, it isn't a huge issue if both input and output are provided. Guess one would simply have to join the original data and Pura's output using the input as a key.
9. Would it be possible to generate output without the additional text? Of course this is easy to remove, but again it adds an additional step. I mean this:
COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1
instead of this:**[CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='**COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1**', details=None)]**
.I am actually for this change.
resolve_identifiers
is supposed to be a convenience function, and the most common use case would be just putting in a list of names and wanting back SMILES, Inchi, etc. I'll go ahead and make this change. #25
I think you're right about that being the most common use case. Would be nice!
This category contains a lot of metal complexes that appear to be difficult. I included two spellings of Rh(NBD)2BF4, but neither was resolved. Interestingly, (R,R)-i-PR-DUPHOS was ok, but (R,R)-Et-DUPHOS was not. For this category I don't see an obvious reason why it doesn't work. Of course, some queries such as
C58H70N2O10P2
were doomed from the start (although this is quite frequent in my real-world data sets). But also minor derivatization of a known ligand ((R)-p-OH-BINAP
) apparently did not work. So, in conclusion: currently it has success about half of the time. (Quite the improvement from v0.1, right?) Metal complexes and slightly non-standard spellings remain difficult. I suspect that adding more services might help. What do you think?I'm beginning to think that we're probably topping out what's possible with API based services. You mentioned scraping catalogues, would crazy would it be to try to assemble a dataset the compiles all the organometallic catalogues? I feel like these would cover the majority right?
- Strem
- Solvias
- Sigma
Hmm, you might indeed be right that this is about the limit. Scraping catalogues would probably work for common materials, although not all of them will have SMILES and InChI available. There might be a lot of curation still to be done. I'd also add Umicore for metal complexes and potentially abcr for ligands.
Would the idea be to have an internal 'Pura' library for resolution?
On the catalogs, I'm thinking I will either include a library inside pura or create an API just for organometallics. I've been able to scrape most of the CAS numbers off of Strem this evening, so I'm pretty hopeful about this direction.
Fingers crossed this will work! If it does, it might be nice to do the same for chiral ligands?
From @rvanputt:
I think the key here is (1) enabling more services and (2) relaxing the agreement algorithm in some cases to handle when only one service can find a compound (see #7).
Taking SL-J001-1 as an example, only PubChem could resolve this name out of the currently available services (PubChem, CIR, CAS, ChemSpider, OPSIN). Even when I looked up SL-J001-1 by its CAS number on Common Chemistry, which should be the definitive source for CAS numbers, nothing returned.
In terms of more services, here are some ideas:
@rvanputt, could you sample some of your difficult organometallics and see if they are available on Sigma or Solvias. If so, I'll look into writing a service for one of those.