Descriptions for output of msa2profile

soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite

https://mmseqs.com

GNU General Public License v3.0

1.4k stars 194 forks source link

Descriptions for output of msa2profile #156

Closed berkelem closed 5 years ago

berkelem commented 5 years ago

Could you please provide descriptions of the output of msa2profile? I couldn't find them anywhere in the documentation.

Specifically, I have the following questions: 1) What is the difference between the profile_consensus file and profile_seq? I can see they have the same number of lines but the _consensus file is slightly larger. 2) When I convert the profile file into PSSM format, each data table has a header of the form "Query profile of sequence {num}", where the value of "num" doesn't seem to correspond with any value in the input alignments. Where does this number come from?

milot-mirdita commented 5 years ago

Regarding the first question:

_consensus is the consensus sequence generated by taking the highest scoring amino acid in each profile position
_seed contains the representative (= first) sequence of each alignment

The files are not really necessary anymore, since both of these sequences are now also stored inside the (binary) profile format. We didn't add separate modules to extract these sequences from profiles yet, so we didn't remove the databases.

Second question: Num corresponds to the database key (.index) of each entry. We could add an option to parse the accession from the corresponding sequence header. Would that be useful? The module was requested from a user to have a human readable output of our binary profile format. We currently use profile2pssm mostly as a debugging tool. What are you using it for?

berkelem commented 5 years ago

Thanks for your reply. I was using the PSSM format to better understand the content of the profile file and how it was related to the _seq and _consensus files. I couldn't see any similarity between the first entry in the PSSM file and either of the first entries in the _seq or _consensus files, hence my confusion. Perhaps an option to have some reference to the original sequence alignment would be useful for clarity.

You mentioned that everything is now stored inside the binary profile. I am working with a particularly large profile (9.1 GB). Do you have any recommendations for optimizing the mmseqs search function with this profile as a target?

milot-mirdita commented 5 years ago

You mean a whole profile database right? How many entries are contained in that database?

You should be able to follow the same advice as for the Pfam database: https://github.com/soedinglab/MMseqs2/wiki#how-to-create-a-target-profile-database-from-pfam

-k 5 will trade of a bit of sensitivity for much smaller memory requirements. You can index the profile database for faster repeated searches.

This should scale well for a couple hundred thousand profiles. For searches against millions of profiles we will hopefully have a different solution soon. There is still some benchmarking to do before its ready for prime time.

berkelem commented 5 years ago

Yes I mean a profile database. There are about 1.2 million entries in the database. Thanks for the suggestions. I have got it working with -k 5 -s 1 after pre-computing the index and it only takes a few minutes. I'll play around with the sensitivity to see how far I can push it in a reasonable time.

milot-mirdita commented 5 years ago

-s 1 is usually not that useful. We just published a benchmark of the profile search with the MMseqs2 Webserver: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty1057/5280135

Around -s 6 would be the preferred sensitivity level, though that might have too steep memory requirements for 1.2M profiles. We will have a different kind of profile search soon with different trade-offs for larger databases.