maovshao / PLMSearch

PLMSearch enables accurate and fast homologous protein search with only sequences as input
https://dmiip.sjtu.edu.cn/PLMSearch
MIT License
62 stars 8 forks source link

replace PDB/swissprot/uniref with my specific protein set #9

Closed lsj-666 closed 1 month ago

lsj-666 commented 1 month ago

Hi developer, thank you for this nice tool! I wonder can I replace swissprot/uniref/PDB with my specific set of protein? In this way, I want to compare 2 sets of my personal protein sets(search homologs in my big protein set)

maovshao commented 1 month ago

Thanks for the great question.

Technological Yes!

Just follow the steps in the https://github.com/maovshao/PLMSearch/blob/main/pipeline.ipynb.

Your requirement can be done with the following steps.

  1. Prepare your embedding (with ./plmsearch/embedding_generate.py) and Pfamscan result (with ./plmsearch/pfam_generate.py) for both the query dataset and the target dataset (2 sets as you described)

  2. Replace the setting in the PLMSearch pipeline with your own embedding and Pfamscan result, like (-qpr, -tpr in ./plmsearch/main_pfam.py) and (-iqe and -ite in ./plmsearch/main_similarity.py)

If you find this answer and PLMSearch helpful, please Star the repository.

lsj-666 commented 4 weeks ago

Thank you for your kind help! Since the number of my sequences is large, it will take some time for my server to follow your suggestions. And I want to consult you another questions, when I get my results table, can it be understood that the pairings appearing in the table are considered homologous by the software? Do we need to further screen or filter based on similarity and other numbers? Thank you so much!

maovshao commented 4 weeks ago

Please refer to our paper (Fig. 6 for instance) for further instructions.

lsj-666 commented 4 weeks ago

Thank you for your help!

lsj-666 commented 3 weeks ago

It seems that the pfam_generate.py steps for a large number of protein sequences will be very slow. Is there any way to speed up the process?

maovshao commented 3 weeks ago

I think it depends on the power of your CPU cores.

Additionally, since pfam_generate.py merely invokes the third-party software Pfamscan, it seems there isn’t much we can modify on our end.

wangleiofficial commented 2 weeks ago

The process of using domains as the pre-filtering stage is still very dependent on hmmer, which is very slow. If no domain information is found, a global search may be required, which consumes a lot of memory, far more than mmseq2, and takes longer. This is not suitable for large-scale sequence retrieval.

maovshao commented 2 weeks ago

Yes

From the perspective of engineering efficiency, there are two more specific situations

  1. The query protein scale is small, and the target protein dataset is large. In this case, the use of PLMSearch will not be greatly affected, because the user only needs to run Pfamscan on a smaller query dataset, and the target protein dataset (Swiss-Prot, UniRef50, etc.) has been calculated in advance.

  2. The query protein scale and the target protein dataset are both large. In this case, the user needs to run Pfamscan on the large query dataset. If the CPU performance is limited, it is recommended to use the SS-predictor method to search directly, so that Pfam information is not required.