jensengroup / RegioML

RegioML predicts the regioselectivity of electrophilic aromatic substitution reactions using machine learning.
MIT License
6 stars 4 forks source link

Questions about using RegioML #1

Open ririjeong opened 2 years ago

ririjeong commented 2 years ago

Hi, I have some questions about RegioML

  1. When there is no probability on a site in molecule, does that mean LGBM model cannot predict a probability Why can't it predict the probability?

  2. What does it mean for black circle?

Thanks

NicolaiRee commented 2 years ago

Hi ririjeong,

Thank you for your interest in RegioML!

To answer your questions: 1) In the visual output, we chose to only highlight atoms with scores above 5%. However, if you e.g. visit the Google Colab Notebook for RegioML (https://t.co/49hfVKuklb?amp=1), we do print all the predicted probabilities for the EAS sites.

2) The black circles are used in our paper to highlight the experimental reaction sites obtained from the Reaxys data.

Best wishes, Nicolai

ririjeong commented 2 years ago

Thank you for answering,

I tried colab code, but there still are some sites that have no predicted probabilites. The molecule in attacted image has 16 sites but only has 12 probabilites. I guess it is because of removing identical atoms. If so, Is there difference btw removing identical atoms and not removing them?

mol

I also have another question regarding model. The shapes of descriptor are different as molecules are changed. Then how does the LGBM model treat various input to use for
predictions?

Sincerely, ririjeong,

NicolaiRee commented 2 years ago

Hi again,

In the output lists only unique EAS sites are shown, so if there are identical sites these are not in the list. However, in the depiction part all EAS sites are taken into account. This means that an EAS site with a score above 5 % as well as identical atoms will be highlighted. This is done in the DescriptorCreator/molecule_svg.py file: highlight_predicted, atom_scores = find_identical_atoms_with_scores(mol, highlight_predicted, atom_scores) There is no difference between removing or keeping identical sites as the model predicts with atomic descriptors.

The shape of the atomic descriptor is always the same size (a 485-dimensional descriptor) no matter what molecule you are exploring. This is because the atomic descriptor is made from a sorting of the atomic CM5 charges according to the Cahn–Ingold–Prelog (CIP) rules. So you can think of this as a convolution of the atomic charges around the atom of interest. Please have a look at Fig. 1 in our paper and note that we stop the sorting at the 5th shell.

Best wishes, Nicolai

ririjeong commented 2 years ago

Thank you,

I tried to use RegioML but still there are some problems I could not solve. I wanted to print out all probabilities of EAS site and found out this code.

image

I removed it and showed the result.

The picture on the left side is the result of before removing the code and the picture on the right side is that of after removiing it. 질문사진

I tried to figure out the reason and extracted descriptors after removing the code. I found out descriptors were different even though atomic sites were same.

What is the reason for changed result?? I cannot figure out what is wrong.

NicolaiRee commented 1 year ago

Hi agian,

If you wish to output all the probabilities of all the possible EAS sites, I will recommend you to import the following in the regioML.py file: from DescriptorCreator.find_atoms import find_identical_atoms_with_scores and then add: atom_indices, pred_proba = find_identical_atoms_with_scores(predictor.rdkit_mol, atom_indices, list(pred_proba))

Screenshot 2022-11-06 at 20 11 26 Screenshot 2022-11-06 at 20 09 08

RegioML is tested in this way and the performance you obtain should be identical to what we report in our paper.

However, I have investigated the issue a bit further and found the following reason. So RegioML relies on a single conformer embedding followed by a fast SQM calculation to obtain the CM5 atomic charges. The charges are then sorted into the input descriptors, which are used by the machine learning model to get a classification score. Here is a figure showing the calculated atomic charges for the particular molecule you are investigating:

Screenshot 2022-11-06 at 20 24 58

As you can see the calculated atomic charges are not completely identical for atoms with otherwise identical ranking. These small deviations results in slightly different input descriptors, which then result in a different classification score. In fact, we could use this finding in a future version by training not only on unique EAS sites but all EAS sites. This would make the machine learning model more robust to these small deviations.

Once again thank you for your interest in RegioML!

Best wishes, Nicolai