topazape / LSTM_Chem

Implementation of the paper - Generative Recurrent Networks for De Novo Drug Design.
The Unlicense
116 stars 55 forks source link

different result about dataset query #10

Open Laser-Cho opened 4 years ago

Laser-Cho commented 4 years ago

I tried to reproduce with the same dataset (chemble22) that the author said was used in the paper by referring to the code created by the you, but the results are different.

I tried below.

SELECT DISTINCT canonical_smiles FROM compound_structures WHERE molregno IN ( SELECT DISTINCT molregno FROM activities WHERE standard_type IN ("Kd", "Ki", "Kb", "IC50", "EC50") AND standard_units = "nM" ); result is [Result: 802320 rows]

Author said "dataset of 677,044 SMILES strings with annotated nanomolar activities(Kd/i/B, IC/EC50) from ChEMBL22 "

So I use Chembl22, and insert [standard_units = "nM"] for "nanomolar" , and [standard_type IN ("Kd", "Ki", "Kb", "IC50", "EC50")] for "activities(Kd/i/B, IC/EC50)"

what I missed?

topazape commented 4 years ago

Hi, @Laser-Cho,

Sorry for the late reply.

You are right, I am aware that the number of molecules used in the paper does not match the number of molecules that can be obtained in the SQL query described in the README.md. However, I don't know the correct SQL query because the paper doesn't give that. If you have any good ideas, please let me know.