Getting PSSMs from the msa output of mmseqs

rezahay commented 2 years ago

Dear all,

Does MMSeqs2 provide a way (a command) to generate PSSM profiles from the MMSeqs2 multiple sequence alignment (msa) output file out.mm_msa?

I think providing such a functionality can help a lot. In this way, we don't need to call psiblast (which downgrades the runtime performance of the machine-learning tools) anymore to generate PSSM features for the protein residues. Alright?

Thanks in advance

milot-mirdita commented 2 years ago

The normal path to a profile from an MMseqs2 search result to a profile would be through the result2profile module:

mmseqs search qdb tdb res tmp
mmseqs result2profile qdb tdb res prof

The prof database will contain the internal MMseqs2 binary profile format, which is currently 25-bytes per MSA column. You can use the profile2pssm module to convert this into a human readable output file:

mmseqs search qdb tdb res tmp
mmseqs result2profile qdb tdb res prof
mmseqs profile2pssm prof pssm [--db-output 0/1 if you want to continue with a plain text file or an mmseqs database]

You can also go from a bunch of MSAs to PSSMs (this is very roundabout, not recommended):

mmseqs search qdb tdb res tmp
mmseqs result2msa qdb tdb res msa
mmseqs msa2profile msa prof
mmseqs profile2pssm prof pssm

This makes more sense if you already have a bunch of external MSAs (e.g. in a tar archive):

mmseqs tar2db archive_with_msa.tar.gz msa --output-dbtype 11
mmseqs msa2profile msa prof
mmseqs profile2pssm prof pssm

rezahay commented 2 years ago

Good morning Milot. Thanks a lot for your response.

I got pssms by performing the following command:

ls -l rw-r--r-- 1 1693345 Jul 28 09:24 out.mm_msa rw-r--r-- 1 4 Jul 28 09:24 out.mm_msa.dbtype rw-r--r-- 1 29 Jul 28 09:24 out.mm_msa.index
mmseqs msa2profile out.mm_msa prof
mmseqs profile2pssm prof pssm

It's extremely fast. The pssm format is as follows: Pos Cns A C D E F G H I K L M N P Q R S T V W Y 0 S 5 -1 -2 -3 0 -1 -2 -2 -2 -3 -1 -1 -1 -3 -1 6 0 0 -2 -1 1 L 0 -1 -2 -3 0 -2 -1 -2 -2 4 8 -1 -1 -3 -2 -2 -1 0 -2 -1 2 E 1 -1 0 4 0 1 -2 -2 2 -3 -1 2 -1 -2 -2 0 -4 -1 -2 -1

Are the numbers log-values? I have to have them normalized between 0 and 1. Any hint is welcome.

Kind regards,

soedinglab / MMseqs2

Getting PSSMs from the msa output of mmseqs #580