Open Laser-Cho opened 4 years ago
Hi, @Laser-Cho,
Sorry for the late reply.
You are right, I am aware that the number of molecules used in the paper does not match the number of molecules that can be obtained in the SQL query described in the README.md. However, I don't know the correct SQL query because the paper doesn't give that. If you have any good ideas, please let me know.
I tried to reproduce with the same dataset (chemble22) that the author said was used in the paper by referring to the code created by the you, but the results are different.
I tried below.
SELECT DISTINCT canonical_smiles FROM compound_structures WHERE molregno IN ( SELECT DISTINCT molregno FROM activities WHERE standard_type IN ("Kd", "Ki", "Kb", "IC50", "EC50") AND standard_units = "nM" );
result is [Result: 802320 rows]Author said "dataset of 677,044 SMILES strings with annotated nanomolar activities(Kd/i/B, IC/EC50) from ChEMBL22 "
So I use Chembl22, and insert [standard_units = "nM"] for "nanomolar" , and [standard_type IN ("Kd", "Ki", "Kb", "IC50", "EC50")] for "activities(Kd/i/B, IC/EC50)"
what I missed?