Open rezahay opened 2 years ago
The normal path to a profile from an MMseqs2 search result to a profile would be through the result2profile module:
mmseqs search qdb tdb res tmp
mmseqs result2profile qdb tdb res prof
The prof
database will contain the internal MMseqs2 binary profile format, which is currently 25-bytes per MSA column.
You can use the profile2pssm
module to convert this into a human readable output file:
mmseqs search qdb tdb res tmp
mmseqs result2profile qdb tdb res prof
mmseqs profile2pssm prof pssm [--db-output 0/1 if you want to continue with a plain text file or an mmseqs database]
You can also go from a bunch of MSAs to PSSMs (this is very roundabout, not recommended):
mmseqs search qdb tdb res tmp
mmseqs result2msa qdb tdb res msa
mmseqs msa2profile msa prof
mmseqs profile2pssm prof pssm
This makes more sense if you already have a bunch of external MSAs (e.g. in a tar archive):
mmseqs tar2db archive_with_msa.tar.gz msa --output-dbtype 11
mmseqs msa2profile msa prof
mmseqs profile2pssm prof pssm
Good morning Milot. Thanks a lot for your response.
I got pssms by performing the following command:
ls -l rw-r--r-- 1 1693345 Jul 28 09:24 out.mm_msa rw-r--r-- 1 4 Jul 28 09:24 out.mm_msa.dbtype rw-r--r-- 1 29 Jul 28 09:24 out.mm_msa.index
mmseqs msa2profile out.mm_msa prof
mmseqs profile2pssm prof pssm
It's extremely fast. The pssm format is as follows: Pos Cns A C D E F G H I K L M N P Q R S T V W Y 0 S 5 -1 -2 -3 0 -1 -2 -2 -2 -3 -1 -1 -1 -3 -1 6 0 0 -2 -1 1 L 0 -1 -2 -3 0 -2 -1 -2 -2 4 8 -1 -1 -3 -2 -2 -1 0 -2 -1 2 E 1 -1 0 4 0 1 -2 -2 2 -3 -1 2 -1 -2 -2 0 -4 -1 -2 -1
Are the numbers log-values? I have to have them normalized between 0 and 1. Any hint is welcome.
Kind regards,
Dear all,
Does MMSeqs2 provide a way (a command) to generate PSSM profiles from the MMSeqs2 multiple sequence alignment (msa) output file out.mm_msa?
I think providing such a functionality can help a lot. In this way, we don't need to call psiblast (which downgrades the runtime performance of the machine-learning tools) anymore to generate PSSM features for the protein residues. Alright?
Thanks in advance