mims-harvard / TDC

Therapeutics Commons: Artificial Intelligence Foundation for Therapeutic Science
https://tdcommons.ai
MIT License
956 stars 169 forks source link

[Docking Leaderboard DRD3] Reproducibility Issues #235

Closed Jonas-Verhellen closed 2 days ago

Jonas-Verhellen commented 3 months ago

Dear maintainers of the TDC project,

I'm trying to reproduce the results obtained in the DRD3 docking group benchmark for the GB-GA model. I am however having a few issues.

  1. I cannot seem to reproduce the docking values for some individual molecules.
    • Some examples taken from the top docking scores reported in the User Group Meeting. I obtain different docking values using the same oracle (reported values shown first): O=c1c2c(Br)cccc2ncn1Cc1cc(F)c(-c2n~c3c(C4=NNN=N4)cccc3o2)cc1F: -12.2 vs -11.4, O=C1OC(=O)C23CCOCC12N=NN3CC12CC(C3=NC(c4cncc5ccccc45)=NN3)(CO1)C2: -12.2 vs -12.5, CC(=O)C(c1cccc(C(=O)c2cccc3ccccc23)c1)c1noc(-c2c[nH]nc2C2CCCCC2)n1: 12.0 vs -10.5
  1. Unfortunately, I can also not locate all the pickle files for the currently claimed performance in the benchmark. The github repo linked to the benchmark is missing the majority of these files. I have noticed the website does have a visualization of the molecules. Is it possible to find (or publicly release) all the molecules in a SMILES format with their docking scores as submitted they were to the benchmark?

  2. It is not entirely clear to me which dateset is used to seed the algorithms. Is it Zinc 250k or guacamol_v1_all.smiles?

Kind regards, Jonas

amva13 commented 3 months ago

Dear Marinka and Maintainers of the TDC Project,

I hope this email finds you well. I am reaching out to you regarding some issues I have encountered while attempting to reproduce the results obtained in the DRD3 docking group benchmark. As I hope to utilize your benchmark as the conclusion of an upcoming paper introducing a novel and significantly more effective generative model, I am keen to resolve these issues.

More specifically, if I look at the best performing model in the benchmark (GB-GA), I am having trouble locating the files for the current performance in the benchmark. The GitHub repository linked to the benchmark appears to be missing the majority of these files. Is it possible to obtain or publicly release all the molecules in SMILES format along with their corresponding docking scores as they were submitted to the benchmark?

In addition, I have encountered discrepancies in the docking values for several individual molecules when compared to the values reported. Some examples: For instance these SMILES, from the smiles_lstm_2_5000.txt file, have markedly different reported docking scores than the ones I currently obtain from the oracle (installed according to the instructions on the TDC website):

O=C(CCOc1ccccc1F)Oc1ccccc1C(=O)CCCCc1ccc(C(F)(F)F)cc1: -15 vs -9.2 O=C(CCOc1ccccc1)Oc1ccccc1C(=O)CCCc1ccccc1F: -15 vs -9.2 O=C(CCOc1ccccc1F)Oc1ccccc1C(=O)CCCOc1ccccc1C(F)(F)F: -15.0 vs -10.3 O=C(Nc1ccccc1F)Oc1ccccc1C(=O)CCc1ccccc1C(F)(F)F: -14.6 vs -9.0 O=C(CCOc1ccccc1F)Oc1ccccc1C(=O)CCCOc1ccccc1Cl: -14.5 vs -9.1 O=C(CCOc1ccccc1F)Oc1ccccc1C(=O)CCCOc1ccccc1F: -14.4 vs -8.9 O=C(CCOc1ccccc1F)Oc1ccccc1C(=O)CCCOCc1ccccc1: -14.4 vs -9.0

I am uncertain whether these discrepancies stem from specific settings, something simple I've missed, or a change in the backend. Would it be possible to please provide any clarification on this matter?

Thank you in advance.

amva13 commented 3 months ago

@Jonas-Verhellen will have a look

amva13 commented 3 months ago

@kexinhuang12345

amva13 commented 3 months ago

@futianfan @wenhao-gao are you able to help with this?

amva13 commented 2 months ago

Hi @Jonas-Verhellen , what version of scikit are you using? What version of TDC? I'm fairly sure the cause is the same as this issue. Checking how to resolve.

https://github.com/mims-harvard/TDC/issues/244

Jonas-Verhellen commented 2 months ago

Hi @amva13, thanks for looking into this! I am using scikit-learn 1.3.0 and pytdc 0.4.1 with python 3.10.12. Let me know if you need any more information.

Jonas-Verhellen commented 3 weeks ago

Hi @amva13,

I'm checking in. How are things on this front? Any more clarity?

Kind regards, Jonas

amva13 commented 3 weeks ago

Hi @Jonas-Verhellen , I expect to be able to dive into this after June 21st. At the moment, there are conferences in the way. Sorry for the inconvenience!

Jonas-Verhellen commented 3 weeks ago

Hi @amva13,

No problem at all. Thanks for looking into this. Have a good time at the conferences!

Kind regards, Jonas

amva13 commented 6 days ago

making note here. this and many similar issues probably due to

1 repository in your mims-harvard organization might be affected by a security vulnerability in nltk  ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts  mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt | 1 repository in your mims-harvard organization might be affected by a security vulnerability in nltk | 1 repository in your mims-harvard organization might be affected by a security vulnerability in nltk |   | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts  mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts  mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts  mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts |   |   |   |   | View all alerts | View all alerts | View all alerts |   | mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt |   | mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 1 repository in your mims-harvard organization might be affected by a security vulnerability in nltk | 1 repository in your mims-harvard organization might be affected by a security vulnerability in nltk |   1 repository in your mims-harvard organization might be affected by a security vulnerability in nltk   ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts  mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts  mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts  mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts |   |   |   |   | View all alerts | View all alerts | View all alerts |   | mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt |   | mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts  mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts  mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts |   |   |   |   | View all alerts | View all alerts | View all alerts |   | mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt |   | mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts  mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts |   |   |   |   | View all alerts | View all alerts | View all alerts |   | mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt |   | mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts | ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts |   |   |   |   | View all alerts | View all alerts | View all alerts |   ntlk unsafe deserialization vulnerability High severity  nltkCVE-2024-39705 View all alerts |   |   |   |   | View all alerts | View all alerts | View all alerts |           View all alerts | View all alerts | View all alerts View all alerts | View all alerts View all alerts   mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt |   | mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt   | mims-harvard/TDCexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txtexamples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt
1 repository in your mims-harvard organization might be affected by a security vulnerability in nltk ntlk unsafe deserialization vulnerability High severity nltk CVE-2024-39705 mims-harvard/TDC [examples/generation/docking_generation/guacamol_tdc/guacamol_baselines/dockers/requirements.txt](https://github.com/mims-harvard/TDC/security/dependabot/799) [examples/generation/docking_generation/guacamol_tdc/guacamol_baselines/requirements.txt](https://github.com/mims-harvard/TDC/security/dependabot/800)
amva13 commented 2 days ago

https://github.com/mims-harvard/TDC/issues/291 <-- solution being worked on in this ticket. follow this one.

amva13 commented 2 days ago

We have updated the oracles for jsk3, gssk3b, and drd2 and reproduced results perfectly.

Unfortunately, when it comes to the oracles in this ticket, there is software dependency on Coley Group's software, and we cannot guarantee cross compatibility at all times. We will in the future look to update our documentation to reflect pure TDC vs community oracles to reflect this. Given inherent stochasticity in the models, changes can be expected for slight versioning changes.

Our new package has been updated to 1.0.0 to indicate the lack of guarantee for identical backwards compatibility. At the end of the day, despite the numbers being different, we still believe they're reliable, but full confidence on whether there are serious dependency concerns can only be addressed by Coley group @wenhao-gao

example but there are many https://github.com/coleygroup/pyscreener (which is not installed in tdc by default)

reproducibility for TDC-maintained oracles is proved as of this update https://github.com/mims-harvard/TDC/pull/293 in pytdc version 1.0.0

we will be closing this and the associated tickets accordingly.

amva13 commented 2 days ago

In addition, please note the benchmark results in the user group meetup are for a particular example and not for any given model. If you believe there's serious issues there, please provide the code and training you're using to evaluate. If you're just running the same oracle, please refer to thee above.

amva13 commented 2 days ago

see https://github.com/mims-harvard/TDC/issues/245 for full description