Inconsistency between values of the data in CHEMBL2147_Ki

a19s commented 1 year ago

Hey, I really appreciate your work - thank you very much for sharing the code and the data.

I found an inconsistency that I couldn't wrap my head around, and would like to ask you to clarify directly:

When looking at the data here: https://github.com/molML/MoleculeACE/blob/main/MoleculeACE/Data/benchmark_data/CHEMBL2147_Ki.csv

the file has a column called "exp_mean [nM]", and a "y" column which should be the -log10(exp_mean), according to visual inspection and to what you wrote in the paper: "The mean Ki or EC50 value for each molecule was computed and subsequently converted into pEC50/pKi values (as the negative logarithm of molar concentrations)"

However, there is an issue: Smiles with the same value of "exp_mean" (e.g. of 100 nM) have "y" values that are either positive or negative (e.g. 2 or -2 in the example below), and I haven't found any way to make sense of this!

smiles	exp_mean [nM]	y
Cc1cncc(-c2cc3c(-c4cccc(N5CCNCC5)n4)n[nH]c3cn2)n1	100	2
Cc1ccc(F)c(-c2nc(C(=O)Nc3cnn(C)c3N3CCCC@@HCC3)c(N)s2)c1F	100	2
Cn1ncc(NC(=O)c2nc(-c3ccccc3F)sc2N)c1N1CCC@HCC(F)(F)C1	100	2
Nc1sc(-c2c(F)cccc2F)nc1C(=O)Nc1cnn(C2CC2)c1N1CCC@HCC(F)(F)C1	100	2
C=C(C)c1ccc(-c2n[nH]c3cnc(-c4cccnc4)cc23)nc1N1CCCC@HC1	100	2
C#Cc1ccc(-c2n[nH]c3cnc(-c4cccnc4)cc23)nc1N1CCCC@HC1	100	2
Cn1ncc(NC(=O)c2nc(-c3ccc(C(F)(F)F)cc3F)sc2N)c1[C@@H]1CCC@@H C@@HCO1	100	2
CO[C@H]1COC@Hsc3N)cnn2C)CC[C@H]1N	100	2
Cn1ncc(NC(=O)c2csc(-c3c(F)cc(C4(F)COC4)cc3F)n2)c1[C@@H]1CCC@@H C@HCO1	100	2
Nc1sc(-c2c(F)cccc2F)nc1C(=O)Nc1cnccc1N1CCCC@HC1	100	2
CN1CCC(N(C)c2ccc3nnc(-c4cccc(C(F)(F)F)c4)n3n2)CC1	100	-2
Cn1c2ccccc2c2c3c(c4c5ccccc5n(CCC#N)c4c21)CNC3=O	100	-2
c1ccc(CNc2cc(-c3c[nH]c4ncccc34)ncn2)cc1	100	-2
CSc1ccc2nc3c(c(Cl)c2c1)CCNC3=O	100	-2
Cc1n[nH]c2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12	100	-2
O=c1[nH]c2sc3c(c2c2nc(-c4ccccc4)nn12)CCCC3	100	-2
O=C1NC(=O)C(c2c[nH]c3ccccc23)=C1c1nc(N2CCNCC2)nc2ccccc12	100	-2

Could you please clarify what is the origin of this inconsistency?

Thank you!

githubXin123 commented 1 year ago

@a19s I checked the dataset and, apart from the few data points you mentioned, there are also some other data points with the same issue: they have identical 'exp_mean' values but different 'y' values.

a19s commented 1 year ago

@a19s I checked the dataset and, apart from the few data points you mentioned, there are also some other data points with the same issue: they have identical 'exp_mean' values but different 'y' values.

Yup @githubXin123, this was just an example that was easy to reproduce, but there are more values with the same issue.

@derekvantilborg any idea of why this is the case?

derekvantilborg commented 1 year ago

Hi, thanks for pointing this out. It gave me a proper scare. This is a post-mortem of what happend with the data:

I read the raw, unprocessed dataframe with data scraped from ChEMBL in data_prep.py
I convert the nM values to -log10 by doing: bioactivity = -np.log10(bioactivity) This is fine and causes no issues
I perform some data processing, cliff calculation, splitting, etc. Everything uses the -log10 values (which are fine)
I save all the processed data to a new csv. Here I convert back from -log10 to the original value using: 10 ** abs(np.array(bioactivity)). This is where it goes wrong. For some reason there is an abs() in there that should not be in there. I don't know why this happend, but it did. This is why 10^(abs(-2)) and 10^(abs(2)) both yield 100nM instead of 0.01 and 100. This means that the 'y' column is the correct value and the 'exp_mean' column is wrong.

Luckily, for all model training, evaluation, etc I just use the -log10 values from the 'y' column. This means that the results of the study should stay the same.

I will update the csvs with their correctly transformed 'exp_mean' values and fix this bug in the code

a19s commented 1 year ago

Thank you very much @derekvantilborg for following this up - I am also very happy to hear about your findings :) Keep up the great work!

githubXin123 commented 1 year ago

@derekvantilborg Thank you for your response. I would also appreciate if you could carefully double-check the SMILES strings corresponding to these data.

derekvantilborg commented 1 year ago

@githubXin123 I'm on it

shenwanxiang commented 1 year ago

@derekvantilborg Hi, Tilborg, the raw data seems have not been fixed: https://github.com/molML/MoleculeACE/blob/main/MoleculeACE/Data/benchmark_data/raw/

derekvantilborg commented 1 year ago

Hi all. I'm aware that the data is currently not fixed yet. I'm working on a revision with the corrected code, data, and results. Recomputing the results takes a while, so I expect to update the repo somewhere next week.

derekvantilborg commented 1 year ago

Thank you all for being so patient. I released a new version (V3) of the benchmark with corrected code, data, and results. We also submitted a correction to the paper. Luckily the findings from the corrected results match the findings in the original paper. I'm very sorry for the inconvenience this bug may have caused some of you.

cheers, Derek

a19s commented 1 year ago

Thank you very much Derek!

On 29 Sep 2023, at 08:14, Derek van Tilborg @.***> wrote:

Thank you all for being so patient. I released a new version (V3) of the benchmark with corrected code, data, and results. We also submitted a correction to the paper. Luckily the findings from the corrected results match the findings in the original paper. I'm very sorry for the inconvenience this bug may have caused some of you.

cheers, Derek

— Reply to this email directly, view it on GitHub https://github.com/molML/MoleculeACE/issues/12#issuecomment-1740415108, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3J64DGJMKOEYHTMINXDI3X4ZYO7ANCNFSM6AAAAAA2EYOYLU. You are receiving this because you were mentioned.

molML / MoleculeACE

Inconsistency between values of the data in CHEMBL2147_Ki #12