molML / MoleculeACE

A tool for evaluating the predictive performance on activity cliff compounds of machine learning models
MIT License
164 stars 19 forks source link

Inconsistency between values of the data in CHEMBL2147_Ki #12

Closed a19s closed 1 year ago

a19s commented 1 year ago

Hey, I really appreciate your work - thank you very much for sharing the code and the data.

I found an inconsistency that I couldn't wrap my head around, and would like to ask you to clarify directly:

When looking at the data here: https://github.com/molML/MoleculeACE/blob/main/MoleculeACE/Data/benchmark_data/CHEMBL2147_Ki.csv

the file has a column called "exp_mean [nM]", and a "y" column which should be the -log10(exp_mean), according to visual inspection and to what you wrote in the paper: "The mean Ki or EC50 value for each molecule was computed and subsequently converted into pEC50/pKi values (as the negative logarithm of molar concentrations)"

However, there is an issue: Smiles with the same value of "exp_mean" (e.g. of 100 nM) have "y" values that are either positive or negative (e.g. 2 or -2 in the example below), and I haven't found any way to make sense of this!

smiles exp_mean [nM] y
Cc1cncc(-c2cc3c(-c4cccc(N5CCNCC5)n4)n[nH]c3cn2)n1 100 2
Cc1ccc(F)c(-c2nc(C(=O)Nc3cnn(C)c3N3CCCC@@HCC3)c(N)s2)c1F 100 2
Cn1ncc(NC(=O)c2nc(-c3ccccc3F)sc2N)c1N1CCC@HCC(F)(F)C1 100 2
Nc1sc(-c2c(F)cccc2F)nc1C(=O)Nc1cnn(C2CC2)c1N1CCC@HCC(F)(F)C1 100 2
C=C(C)c1ccc(-c2n[nH]c3cnc(-c4cccnc4)cc23)nc1N1CCCC@HC1 100 2
C#Cc1ccc(-c2n[nH]c3cnc(-c4cccnc4)cc23)nc1N1CCCC@HC1 100 2
Cn1ncc(NC(=O)c2nc(-c3ccc(C(F)(F)F)cc3F)sc2N)c1[C@@H]1CCC@@HC@@HCO1 100 2
CO[C@H]1COC@Hsc3N)cnn2C)CC[C@H]1N 100 2
Cn1ncc(NC(=O)c2csc(-c3c(F)cc(C4(F)COC4)cc3F)n2)c1[C@@H]1CCC@@HC@HCO1 100 2
Nc1sc(-c2c(F)cccc2F)nc1C(=O)Nc1cnccc1N1CCCC@HC1 100 2
CN1CCC(N(C)c2ccc3nnc(-c4cccc(C(F)(F)F)c4)n3n2)CC1 100 -2
Cn1c2ccccc2c2c3c(c4c5ccccc5n(CCC#N)c4c21)CNC3=O 100 -2
c1ccc(CNc2cc(-c3c[nH]c4ncccc34)ncn2)cc1 100 -2
CSc1ccc2nc3c(c(Cl)c2c1)CCNC3=O 100 -2
Cc1n[nH]c2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12 100 -2
O=c1[nH]c2sc3c(c2c2nc(-c4ccccc4)nn12)CCCC3 100 -2
O=C1NC(=O)C(c2c[nH]c3ccccc23)=C1c1nc(N2CCNCC2)nc2ccccc12 100 -2

Could you please clarify what is the origin of this inconsistency?

Thank you!

githubXin123 commented 1 year ago

@a19s I checked the dataset and, apart from the few data points you mentioned, there are also some other data points with the same issue: they have identical 'exp_mean' values but different 'y' values.

a19s commented 1 year ago

@a19s I checked the dataset and, apart from the few data points you mentioned, there are also some other data points with the same issue: they have identical 'exp_mean' values but different 'y' values.

Yup @githubXin123, this was just an example that was easy to reproduce, but there are more values with the same issue.

@derekvantilborg any idea of why this is the case?

derekvantilborg commented 1 year ago

Hi, thanks for pointing this out. It gave me a proper scare. This is a post-mortem of what happend with the data:

Luckily, for all model training, evaluation, etc I just use the -log10 values from the 'y' column. This means that the results of the study should stay the same.

I will update the csvs with their correctly transformed 'exp_mean' values and fix this bug in the code

a19s commented 1 year ago

Thank you very much @derekvantilborg for following this up - I am also very happy to hear about your findings :) Keep up the great work!

githubXin123 commented 1 year ago

@derekvantilborg Thank you for your response. I would also appreciate if you could carefully double-check the SMILES strings corresponding to these data.

derekvantilborg commented 1 year ago

@githubXin123 I'm on it

shenwanxiang commented 1 year ago

@derekvantilborg Hi, Tilborg, the raw data seems have not been fixed: https://github.com/molML/MoleculeACE/blob/main/MoleculeACE/Data/benchmark_data/raw/

derekvantilborg commented 1 year ago

Hi all. I'm aware that the data is currently not fixed yet. I'm working on a revision with the corrected code, data, and results. Recomputing the results takes a while, so I expect to update the repo somewhere next week.

derekvantilborg commented 1 year ago

Thank you all for being so patient. I released a new version (V3) of the benchmark with corrected code, data, and results. We also submitted a correction to the paper. Luckily the findings from the corrected results match the findings in the original paper. I'm very sorry for the inconvenience this bug may have caused some of you.

cheers, Derek

a19s commented 1 year ago

Thank you very much Derek!

On 29 Sep 2023, at 08:14, Derek van Tilborg @.***> wrote:

Thank you all for being so patient. I released a new version (V3) of the benchmark with corrected code, data, and results. We also submitted a correction to the paper. Luckily the findings from the corrected results match the findings in the original paper. I'm very sorry for the inconvenience this bug may have caused some of you.

cheers, Derek

— Reply to this email directly, view it on GitHub https://github.com/molML/MoleculeACE/issues/12#issuecomment-1740415108, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3J64DGJMKOEYHTMINXDI3X4ZYO7ANCNFSM6AAAAAA2EYOYLU. You are receiving this because you were mentioned.