Possible test set leak - Githubissues

Hi,

First of all, thank you for your very interesting work on predicting molecular properties! I already tried siamese networks for those types of model, but unfortunately without any notable success for the moment.

I find your solution to build a vector representing the dataset with a pretrained model very interesting, and your benchmarks speak for themselves.

However when trying to reproduce the results (in particular : https://github.com/ph-mehdi/BioAct-Het/blob/main/(MUV)%20Association_based_strategy.ipynb) I think I found two main test set leaks in your training procedure.

First, you're using the DGLLife pretrained model GCN_attentivefp_MUV. As I understand it, it has been trained on the full MUV dataset in a supervised manner, so I am wondering if your input embeddings are generated by a model trained on the test set.

Also, when building the vector representing each assay, you're averaging Morgan FPs of the whole dataset, without taking into account future splits. I think this vector should rather be constructed after the splits, only on train Morgan FPs.

Am I wrong on these points ? Have you done any other tests that show that the model works well without these leaks?

Thank you in advance, Paul

First of all, thank you very much for taking the time to review this work. About the questions you asked about your first question, it should be said that what you said is true and this may have happened, but we have examined this issue in the relevant section of the article https://pubs.acs.org/doi/10.1021/acsomega.3c05778# "Dependency Assessment of the Model to the Pre-trained GCN Models". We gave it and its corresponding code is here https://github.com/CBRC-lab/BioAct-Het/blob/main/Transfer_Learning_%26_Case_Study.ipynb. In this experiment, we use the GCN-AttentiveFP model pre-trained on the SIDER database to represent the chemical compounds in the Tox21 database. We also remove the common chemical structures between SIDER and Tox21. BioAct-Het is then trained on Tox21 using the chemical representations obtained from GCN-AttentiveFP pre-trained on SIDER. Regarding your second question, I must say that we designed three different strategies to check the performance of the model, the "Compound-Based Strategy" strategy section, which is fully described in the article. In this strategy, we create a profile that is created using fingerprint vectors in such a way that the combinations of the training set do not share with the test set, and the profile created for the training set is completely different from the profile of the test set.

Best wishes

On Fri, Feb 16, 2024 at 2:46 PM polo9719 @.***> wrote:

Hi,

First of all, thank you for your very interesting work on predicting molecular properties! I already tried siamese networks for those types of model, but unfortunately without any notable success for the moment.

I find your solution to build a vector representing the dataset with a pretrained model very interesting, and your benchmarks speak for themselves.

However when trying to reproduce the results (in particular : https://github.com/ph-mehdi/BioAct-Het/blob/main/(MUV)%20Association_based_strategy.ipynb) I think I found two main test set leaks in your training procedure.

First, you're using the DGLLife pretrained model GCN_attentivefp_MUV. As I understand it, it has been trained on the full MUV dataset in a supervised manner, so I am wondering if your input embeddings are generated by a model trained on the test set.

Also, when building the vector representing each assay, you're averaging Morgan FPs of the whole dataset, without taking into account future splits. I think this vector should rather be constructed after the splits, only on train Morgan FPs.

Am I wrong on these points ? Have you done any other tests that show that the model works well without these leaks?

Thank you in advance, Paul

— Reply to this email directly, view it on GitHub https://github.com/ph-mehdi/BioAct-Het/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASL5X25SPXYNBI7UMSPNOPTYT45Y5AVCNFSM6AAAAABDL3NRECVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEZTQMZZGE4DIOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ph-mehdi / BioAct-Het

Possible test set leak #1