Create profiles from multiline fasta MSA

genomewalker commented 7 years ago

Expected Behavior

When using a MSA DB from a fasta MSA with multi-lines msa2profile should create a profile.

Current Behavior

Now it fails with the error Member sequence 0 in entry 0 too long!

Steps to Reproduce (for bugs)

Fasta MSA with multi-line: test_msa_ml.fa.gz Fasta MSA with single line: test_msa_sl.fa.gz

When creating the profile with the multi line MSA:

ffindex_build -s test_msa_ml_db test_msa_ml_db.index test_msa_ml.fa
mmseqs msa2profile test_msa_ml_db test_profile_ml_db --msa-type 2 --threads 1

It results in:

Program call:
test_msa_ml_db test_profile_ml_db --msa-type 2 --threads 1

MMseqs Version:                         f1307fed7fda354ba3a1d126960de17e95ef70e9
MSA type                                2
Sub Matrix                              blosum62.out
Pseudo count a                          1
Pseudo count b                          1.5
Compositional bias                      1
Use global sequence weighting           false
Filter MSA                              1
Minimum coverage                        0
Minimum seq. id.                        0
Minimum score per column                -20
Maximum sequence identity threshold     0.9
Select n most diverse seqs              1000
Threads                                 1
Verbosity                               3

Finding maximum sequence length and set size.
Compute profiles from MSAs.
Member sequence 0 in entry 0 too long!
Invalid msa 0! Skipping entry.
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for processing: 0 h 0 m 0s

But when using the fasta MSA with single lines:

ffindex_build -s test_msa_sl_db test_msa_sl_db.index test_msa_sl.fa
mmseqs msa2profile test_msa_sl_db test_profile_sl_db --msa-type 2 --threads 1

It seems to work perfectly:

Program call:
test_msa_sl_db test_profile_sl_db --msa-type 2 --threads 1

MMseqs Version:                         f1307fed7fda354ba3a1d126960de17e95ef70e9
MSA type                                2
Sub Matrix                              blosum62.out
Pseudo count a                          1
Pseudo count b                          1.5
Compositional bias                      1
Use global sequence weighting           false
Filter MSA                              1
Minimum coverage                        0
Minimum seq. id.                        0
Minimum score per column                -20
Maximum sequence identity threshold     0.9
Select n most diverse seqs              1000
Threads                                 1
Verbosity                               3

Finding maximum sequence length and set size.
Compute profiles from MSAs.
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for processing: 0 h 0 m 0s

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): f1307fed7fda354ba3a1d126960de17e95ef70e9
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): self-compiled
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: gcc version 5.4.0 and cmake version 3.9.4
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): CPU: Intel(R) Xeon(R) CPU E7-4820 v4 @ 2.00GHz Memory: 2TB
Operating system and version: Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux

milot-mirdita commented 7 years ago

Thanks for the bug report.

                case '\n':
                    if (inHeader) {
                        inHeader = false;
                    } else {
                        if (seqLength > maxSeqLength) {
                            maxSeqLength = seqLength;
                        }
                        seqLength = 0;
                    }
                    break;
                default:
                    if (!inHeader) {
                        seqLength++;
                    }
                    break;

These lines in util/msa2profile.cpp are wrong, i'll see if I can fix the issue tomorrow.

(Side remark, databases from ffindex_build calls are somewhat dangerous for mmseqs modules that do random accesses (msa2profile only does linear access). Please sort the index file numerically first (LC_ALL=C sort -n DB.index > DB.index_sorted && mv -f DB.index_sorted DB.index)).

genomewalker commented 7 years ago

Thanks Milot!

milot-mirdita commented 7 years ago

Thanks for the bug report again. The issue should now be resolved.

Another caveat about msa2profile. It requires a sensible query sequence, since all gap columns with respect to the the query are discarded. You might first want to execute hhconsensus from our hh-suite software on each MSA. With the -M parameter you can choose which columns are match columns are included for consensus sequence computation. This consensus sequence is then prepended to the MSA and msa2profile will not discard potentially useful columns in the MSA.

genomewalker commented 7 years ago

Thanks! Just tested it and seems to work fine now. Regarding the consensus building, we already have HMM (HMMER3) profiles for those MSAs. Reading MMseqs2 help there is convertprofiledb that seems to be able to convert the HMM from HMMER3 to the MMeqs2 format. Do you recommend it? Or better I go through hhconsensus and use msa2profile?

milot-mirdita commented 7 years ago

Using the HMMER3 hmms is probably a bad idea, they already include pseudocounts, which will negatively affect the sensitivity of MMseqs2.

I am currently evaluating all those tools again, but I don't have a clear recommendation yet.

If your first sequence in the alignment is not a real query, then your two options are to use hhconsensus + msa2profile or hhconsensus + hhmake -nocontxt -diff 1000 (way slower) + convertprofiledb. msa2profile is the newer tool as not as well tested as convertprofiledb. The testing we already did indicates that it is working well though.

soedinglab / MMseqs2