Closed berkelem closed 5 years ago
Regarding the first question:
_consensus
is the consensus sequence generated by taking the highest scoring amino acid in each profile position_seed
contains the representative (= first) sequence of each alignmentThe files are not really necessary anymore, since both of these sequences are now also stored inside the (binary) profile format. We didn't add separate modules to extract these sequences from profiles yet, so we didn't remove the databases.
Second question: Num corresponds to the database key (.index) of each entry. We could add an option to parse the accession from the corresponding sequence header. Would that be useful?
The module was requested from a user to have a human readable output of our binary profile format. We currently use profile2pssm
mostly as a debugging tool. What are you using it for?
Thanks for your reply.
I was using the PSSM format to better understand the content of the profile
file and how it was related to the _seq
and _consensus
files. I couldn't see any similarity between the first entry in the PSSM file and either of the first entries in the _seq
or _consensus
files, hence my confusion. Perhaps an option to have some reference to the original sequence alignment would be useful for clarity.
You mentioned that everything is now stored inside the binary profile. I am working with a particularly large profile (9.1 GB). Do you have any recommendations for optimizing the mmseqs search
function with this profile as a target?
You mean a whole profile database right? How many entries are contained in that database?
You should be able to follow the same advice as for the Pfam database: https://github.com/soedinglab/MMseqs2/wiki#how-to-create-a-target-profile-database-from-pfam
-k 5
will trade of a bit of sensitivity for much smaller memory requirements. You can index the profile database for faster repeated searches.
This should scale well for a couple hundred thousand profiles. For searches against millions of profiles we will hopefully have a different solution soon. There is still some benchmarking to do before its ready for prime time.
Yes I mean a profile database. There are about 1.2 million entries in the database.
Thanks for the suggestions. I have got it working with -k 5 -s 1
after pre-computing the index and it only takes a few minutes. I'll play around with the sensitivity to see how far I can push it in a reasonable time.
-s 1
is usually not that useful. We just published a benchmark of the profile search with the MMseqs2 Webserver:
https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty1057/5280135
Around -s 6
would be the preferred sensitivity level, though that might have too steep memory requirements for 1.2M profiles. We will have a different kind of profile search soon with different trade-offs for larger databases.
Could you please provide descriptions of the output of
msa2profile
? I couldn't find them anywhere in the documentation.Specifically, I have the following questions: 1) What is the difference between the
profile_consensus
file andprofile_seq
? I can see they have the same number of lines but the_consensus
file is slightly larger. 2) When I convert theprofile
file into PSSM format, each data table has a header of the form "Query profile of sequence {num}", where the value of "num" doesn't seem to correspond with any value in the input alignments. Where does this number come from?