Closed lsj-666 closed 1 month ago
Thanks for the great question.
Technological Yes!
Just follow the steps in the https://github.com/maovshao/PLMSearch/blob/main/pipeline.ipynb.
Your requirement can be done with the following steps.
Prepare your embedding (with ./plmsearch/embedding_generate.py
) and Pfamscan result (with ./plmsearch/pfam_generate.py
) for both the query dataset and the target dataset (2 sets as you described)
Replace the setting in the PLMSearch pipeline with your own embedding and Pfamscan result, like (-qpr, -tpr in ./plmsearch/main_pfam.py) and (-iqe and -ite in ./plmsearch/main_similarity.py)
If you find this answer and PLMSearch helpful, please Star
the repository.
Thank you for your kind help! Since the number of my sequences is large, it will take some time for my server to follow your suggestions. And I want to consult you another questions, when I get my results table, can it be understood that the pairings appearing in the table are considered homologous by the software? Do we need to further screen or filter based on similarity and other numbers? Thank you so much!
Please refer to our paper (Fig. 6 for instance) for further instructions.
Thank you for your help!
It seems that the pfam_generate.py steps for a large number of protein sequences will be very slow. Is there any way to speed up the process?
I think it depends on the power of your CPU cores.
Additionally, since pfam_generate.py
merely invokes the third-party software Pfamscan
, it seems there isn’t much we can modify on our end.
The process of using domains as the pre-filtering stage is still very dependent on hmmer, which is very slow. If no domain information is found, a global search may be required, which consumes a lot of memory, far more than mmseq2, and takes longer. This is not suitable for large-scale sequence retrieval.
Yes
From the perspective of engineering efficiency, there are two more specific situations
The query protein scale is small, and the target protein dataset is large. In this case, the use of PLMSearch
will not be greatly affected, because the user only needs to run Pfamscan
on a smaller query dataset, and the target protein dataset (Swiss-Prot
, UniRef50
, etc.) has been calculated in advance.
The query protein scale and the target protein dataset are both large. In this case, the user needs to run Pfamscan on the large query dataset. If the CPU performance is limited, it is recommended to use the SS-predictor
method to search directly, so that Pfam information is not required.
Hi developer, thank you for this nice tool! I wonder can I replace swissprot/uniref/PDB with my specific set of protein? In this way, I want to compare 2 sets of my personal protein sets(search homologs in my big protein set)