Poor results on LIT-PCBA - ideas to improve

VladVin commented 1 year ago

Hey @samsledje ! Thanks for the great work you are doing, it's really a powerful idea to apply protein-ligand similarity search for virtual screening.

According to the results of your current state: I tested the pre-trained model on LIT-PCBA in both modes (ligand-ligand and ligand-protein), and it turns out that it cannot outperform the standard ECFP4 fingerprints. Here is the comparison: benchmark_compare_auroc benchmark_compare_ef1

On the two plots, there are evaluations of different methods: ECFP4, GROVER, Uni-Mol and ConPLex based on two metrics - AUROC and EF1%. As it can be seen, protein-ligand matching performs nearly to random selection as per AUROC=50.9%. For the ligand-ligand matching it's almost the same as ECFP4, and I assume this is because of the identical nature of the ConPLex and ECFP4 embeddings. I used cosine similarity for all tests (euclidean distance performed worse).

So, my understanding of potential improvements:

DUD-E is highly biased dataset as was already discussed in many papers like this and this. It shouldn't be used for testing purposes.
Training on part of DUD-E and testing on another part leads to learning the imported biases, that's why you see a good latent space for DUD-E.
The embeddings contain a lot of zero values, I even encountered a totally zeroed embedding. That's because Morgan fingerprints themselves are mostly of zero values. To improve, I suggest to use a neural network fingerprints from a model like GROVER, they are continuous in space.

samsledje commented 1 year ago

Hey @VladVin ,

Thanks for sharing this, it's really interesting, and we welcome suggestions on how to improve ConPLex!

Can you clarify what you mean by the two modes? ConPLex was designed for protein-ligand interaction prediction. It would be helpful to have more detail on how these experiments were run and how ConPLex was used. I'd also be curious to see the performance on your benchmarks of different ConPLex models -- we're still working on development of the package, but I'm happy to send you binaries for models trained on DAVIS, BioSNAP, etc. as well as the one currently available for download.

I'm happy to look into fine-tuning ConPLex using the LIT-PCBA non-binders as well, and I think this is a good suggestion. We chose to split DUD-E by the target type to ensure that the model training could generalize to several different types of targets.

We've tried several different methods for intial ligand features, including some neural network-based fingerprints (see Supplementary methods) and found that the Morgan fingerprint yielded the best performance on the binary prediction task. Generally, we've found that it's a bit trickier to have a foundational model for small molecule representations that is broadly applicable in the way that PLMs seem to be. However, we haven't evaluated GROVER specifically in this framework, and it may improve performance. ConPLex is designed to allow for easily changing out the protein and ligand input representations, and this is definitely something we're interested in pursuing further.

Thanks!

VladVin commented 1 year ago

As for the two scenarios I was testing, here is what they looked like:

Protein-ligand matching. I took the ProtBertFeaturizer and featurized all the targets within LIT-PCBA on concatenated protein sequences (if the targets had multiple parts). Then I calculated fingerprints of actives and inactives with MorganFeaturizer. Next I ran this model on both inputs and calculated the cosine distance between the protein embeddings and the molecules embeddings. Then I calculated the metrics
Ligand-ligand matching. I featurized the known ligands, actives and inactives with the MorganFeaturizer, and ran the same neural net on these inputs. Then I calculated the distance from the known ligands to actives and inactives, and calculated the metrics

I don't quite get what do you want to fine-tune the model on LIT-PCBA for. For me, LIT-PCBA is a great independent benchmark with minimal biases so that it can be used for real testing of how good the general models perform for virtual screening.

As for the DUD-E, it contains a lot of biases, so splitting the dataset in train/test doesn't actually work. Your model learns the methodology of how the DUD-E is constructed, not the actual active/inactive clusterization.

rs239 commented 1 year ago

@VladVin thanks for your analysis and telling us about the LIT-PCBA dataset! I'm a little confused, unfortunately. Is the goal here to distinguish inactive from actives, or protein-drug binding? Since most of the bars in the plots correspond to just drug representations, it seems the goal is the former. But then what does the prot2lig mode do? Thank you again for sharing this-- we're keen to improve ConPLex further!

VladVin commented 1 year ago

Yes, the goal of the benchmark is to separate actives vs inactives. The prot2lig is a mode of protein-ligand matching, and lig2lig is a mode of ligand-ligand matching. I uploaded the ConPLex benchmarking code as a GitHub Gist, check it out. If you have any questions, let me know

samsledje / ConPLex

Poor results on LIT-PCBA - ideas to improve #24