Closed osession closed 1 year ago
Thanks for raising the issue! I think yeah, overall, positive is harder to predict accurately since we are predicting pIC50, which is in log-scale, the positive ones are spread out while the negative ones are concentrated. It is harder to predict raw IC50 since positive can be an extremely small number. One potential way to try is to do target-specific training using the molecule property model. Or use a binary classification model
Hello,
I have been using the DTI virtual screening on a protein target using several compounds that had the highest binding affinity in bindingDB (my positive reference dataset) and several compounds that had the lowest binding affinity in bindingDB (my negative reference dataset). I ran all the pre-trained bindingDB models that are provided to see how their predictions compare to the actual IC50 values for both my positive and negative datasets. I found that the models predict much more accurately for the negative reference than the positive reference. For example, the cnn_cnn_bindingDB model predicted the IC50 correctly within a range of +/-1 for 75% of the negative reference compounds while it only predicted correctly within +/-1 for 10% of the positive reference compounds for the GRM5 target. I have observed a similar pattern using the same process for several other protein targets and the other bindingDB pre-trained models.
Do you have any thoughts on how the models were trained might be causing this unbalance? I thought that since I was providing test data that came from the same dataset that the models were trained on, the prediction would be fairly accurate on both ends of the IC50 range. Do you have any recommendations for changes to make to the parameters or other pieces of your code that might help?
Thank you!