rxn4chemistry / rxn_yields

Code complementing our manuscript on the prediction of chemical reaction yields (https://iopscience.iop.org/article/10.1088/2632-2153/abc81d) and data augmentation strategies (https://doi.org/10.26434/chemrxiv.13286741).
https://rxn4chemistry.github.io/rxn_yields/
MIT License
107 stars 26 forks source link

Problem with reproducing Suzuki-Miyaura results #7

Closed Nanotekton closed 3 years ago

Nanotekton commented 3 years ago

Hi! I've successfully trained from scratch the Buchwald model with your training/evaluation scripts. However, in the case of Suzuki reaction I'm getting negative R2, the model seems to not learn at all. Could you confirm the hyperparameters provided are correct? (the data should be ok, as I was able to get nice results with saved models downloaded from this repo).

As a side note: what's the meaning of '~' in SMILES representation of Pd complexes? Buchwald dataset suggest sth like a coordination bond, however, in the Suzuki dataset it resembles more a special separator. I think it was in described in some paper, but I couldn't find it.

Nanotekton commented 3 years ago

I guess I found a solution, though it's counterintuitive. In training_scripts/launch_suzuki_miyaura_training.py, parameter "evaluate_during_training" in model_args (line 87) should be set to True.

However, my question about the meaning of tilde is still open.

pschwllr commented 3 years ago

I agree, it's counterintuitive but I'm glad you found a solution.

In the Buchwald dataset, the aim was to describe the original catalyst structure (Ahneman et al., Science 360, 186–190 (2018)):

image

The tricky part is the Nitrogen atom (RDKit will raise an exception if the explicit valence is 4). image I've seen that the canonical version of that SMILES generated by RDKit results in: image Hence, this choice was not ideal. Now, I would probably go for another catalyst representation like: O=S(=O)(O[Pd-]1[NH2+]C2C=CC=CC=2C2C=CC=CC1=2)C(F)(F)F image

Anyhow, I would not expect this to significantly change the performance of the models.

For the Suzuki dataset, the ~ in CC(=O)O~CC(=O)O~[Pd] is used as fragment group bond to keep the "Pd(OAc)2" compound together in the reaction string. We introduced this fragment group bond in Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. We discuss it in the SI of that article.

Nanotekton commented 3 years ago

Thanks, now I know everything I wanted. Closing issue.