labstructbioinf / pLM-BLAST

Detection of remote homology by comparison of protein language model representations
https://toolkit.tuebingen.mpg.de/tools/plmblast
MIT License
45 stars 5 forks source link

[QUESTION/SUGGESTION] Determining realistic use cases and limitations #53

Closed DaRinker closed 1 month ago

DaRinker commented 2 months ago

Hello,

I'm coming back to pLM-BLAST after leaving it for a while. I really like the approach and wanted to see what improvements had been been made since October 2023. I'm still hoping to apply it to a very large data set (1000 proteomes) but need to confirm that it's realistically feasible.

I'm happy to say that the parallelization improvement has really helped. It had made the GPU compute time no-longer a limitation for me. (So, thank you!)

However, I'm now moving on to calculating the sizes of the embeddings, and it seems like these will get very large very quickly. My current estimate is that I will need at least 8TB for all my proteomes, Does this sound correct?

While I think I can come up with a temporary storage solution on our HPC, I am now wondering how this would then affect query times and memory? How to these query resources scale in relation to the size/number of embeddings?

Perhaps, it would be useful if the documentation could breakdown a few use cases and what are realistic compute times (GPU and CPU) along with RAM requirements for processing...

staszekdh commented 2 months ago

The ECOD30 database, which contains 32k domain-sized entries, is 11GB in size. Assuming that the average number of genes in a bacterial genome is 4k, this would be >1TB for 1000 genomes. If you are working with eukaryotic genomes, your estimate will be more accurate. I am afraid that with the current database format (one file per embedding in a single directory) it would be inefficient and cumbersome to handle. (We are experimenting with the h5 format, but this is a work in progress).

I assume your goal is to cluster all the ORFs from these 1000 genomes? If this is the case, then it would be best to use the --only-scan flag. It tells pLM-BLAST to do only the chunk cosine similarity scan (see the paper for details) and skip the alignment part. This is fast and the resulting similarity matrix can be used for clustering (I attach an example of clustering 3000 peptides which took ~1min). This feature is still undocumented, but we plan to add it along with examples in the coming days.

VA2(2)

In the example above, pLM-BLAST provides HHpred-level sensitivity, which brings me to another issue. You may have a lot of very similar sequences in your data set. I would suggest clustering them first with mmseqs2 or cd-hit and then using pLM-BLAST only for the representatives of the clusters.