soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
527 stars 132 forks source link

Segmentation fault when searching against custom database #151

Open rmostowy opened 5 years ago

rmostowy commented 5 years ago

Expected Behavior

I am trying to build a custom database from my own fasta file by (1) clustering all protein sequences using mmseqs and then (2) following the steps of [https://github.com/soedinglab/hh-suite/wiki#building-customized-databases], and then search this database with a query sequence.

Current Behavior

While clustering and database creation finish without problems, I run into problems when performing multiple iterations of hhblits search against the newly build database.

Steps to Reproduce (for bugs)

All files mentioned here are compressed in the attached files.zip.

I begin with a fasta file input.fa, and then execute the following series of commands:

# 1. cluster
mmseqs createdb input.fa DB
mmseqs cluster DB DB_clu tmp

# 2. create msa
mmseqs result2msa DB DB DB_clu msa --compress

# 3. make hmm profiles
mpirun -np 4 ffindex_apply msa_ca3m.ff{data,index} -i msa_hhm.ffindex -d msa_hhm.ffdata -- hhmake -i stdin -o stdout -v 03

# 4. create the cs219 formats
ln -s DB_h msa_header.ffdata
ln -s DB_h.index msa_header.ffindex
ln -s DB msa_sequence.ffdata
ln -s DB.index msa_sequence.ffindex
mpirun -np 4 cstranslate -f -i msa -o msa_cs219 -A /usr/local/hhsuite/data/cs219.lib -D /usr/local/hhsuite/data/context_data.lib -x 0.3 -c 4 -I ca3m -b

# 5. reorder the entries in the hmm and a3m databases
sort -k3 -n msa_cs219.ffindex | cut -f1 > sorting.dat

ffindex_order sorting.dat msa_hhm.ff{data,index} msa_hhm_ordered.ff{data,index}
mv msa_hhm_ordered.ffindex msa_hhm.ffindex
mv msa_hhm_ordered.ffdata msa_hhm.ffdata

ffindex_order sorting.dat msa_ca3m.ff{data,index} msa_ca3m_ordered.ff{data,index}
mv msa_ca3m_ordered.ffindex msa_ca3m.ffindex
mv msa_ca3m_ordered.ffdata msa_ca3m.ffdata

# 6. clean
rm -rf DB* msa_header.ff{data,index} msa_sequence.ff{data,index} sorting.dat tmp msa msa.index

After this, I perform a search with a query sequence using

hhblits -i query.fa -d msa -o out.hhr

HH-suite Output (for bugs)

The full output of the query search is:

- 14:34:18.359 INFO: Searching 87 column state sequences.

- 14:34:18.716 INFO: query.fa is in A2M, A3M or FASTA format

- 14:34:18.717 INFO: Iteration 1

- 14:34:18.997 INFO: Prefiltering database

- 14:34:19.268 INFO: HMMs passed 1st prefilter (gapless profile-profile alignment)  : 87

- 14:34:19.271 INFO: HMMs passed 2nd prefilter (gapped profile-profile alignment)   : 87

- 14:34:19.271 INFO: HMMs passed 2nd prefilter and not found in previous iterations : 87

- 14:34:19.271 INFO: Scoring 87 HMMs using HMM-HMM Viterbi alignment

- 14:34:19.399 INFO: Alternative alignment: 0

- 14:34:19.422 ERROR: sequence length (467) does not fit to read MSA (match states: 468) of file 74!

- 14:34:19.422 ERROR:   Your cs219 states might not fit your multiple sequence alignments.

- 14:34:19.448 ERROR: sequence length (411) does not fit to read MSA (match states: 412) of file 85!

- 14:34:19.448 ERROR:   Your cs219 states might not fit your multiple sequence alignments.

- 14:34:19.459 ERROR: sequence length (402) does not fit to read MSA (match states: 403) of file 84!

- 14:34:19.459 ERROR:   Your cs219 states might not fit your multiple sequence alignments.

- 14:34:19.462 ERROR: sequence length (396) does not fit to read MSA (match states: 397) of file 73!

- 14:34:19.462 ERROR:   Your cs219 states might not fit your multiple sequence alignments.

- 14:34:19.468 ERROR: sequence length (387) does not fit to read MSA (match states: 388) of file 45!

- 14:34:19.468 ERROR:   Your cs219 states might not fit your multiple sequence alignments.

- 14:34:19.476 ERROR: sequence length (377) does not fit to read MSA (match states: 378) of file 90!

- 14:34:19.476 ERROR:   Your cs219 states might not fit your multiple sequence alignments.

- 14:34:19.493 ERROR: sequence length (354) does not fit to read MSA (match states: 355) of file 89!

- 14:34:19.493 ERROR:   Your cs219 states might not fit your multiple sequence alignments.

- 14:34:19.505 ERROR: sequence length (324) does not fit to read MSA (match states: 325) of file 79!

- 14:34:19.506 ERROR:   Your cs219 states might not fit your multiple sequence alignments.

- 14:34:19.518 ERROR: sequence length (327) does not fit to read MSA (match states: 328) of file 68!

- 14:34:19.518 ERROR:   Your cs219 states might not fit your multiple sequence alignments.

- 14:34:19.524 ERROR: sequence length (308) does not fit to read MSA (match states: 309) of file 66!

- 14:34:19.524 ERROR:   Your cs219 states might not fit your multiple sequence alignments.

- 14:34:19.554 INFO: 87 alignments done

- 14:34:19.554 INFO: Alternative alignment: 1

- 14:34:19.572 INFO: 7 alignments done

- 14:34:19.572 INFO: Alternative alignment: 2

- 14:34:19.572 INFO: Alternative alignment: 3

- 14:34:19.699 INFO: Realigning 10 HMM-HMM alignments using Maximum Accuracy algorithm

Segmentation fault: 11

Context

The context is that I would like to create a matrix of HMM-HMM similarities for a numer of different protein sub-families, each of which has an individual HMM profile. Hence, I want to combine all HMMs into a single database to search against.

Your Environment

Include as many relevant details about the environment you experienced the issue in.

milot-mirdita commented 5 years ago

Does it also crash if you run the same search with HHsearch instead?

rmostowy commented 5 years ago

No it doesn't, nor does it crash with the command above and -n 1 option.

milot-mirdita commented 5 years ago

Will your database remain small (<100k profiles)? We generally recommend using HHsearch for small databases due its slightly higher sensitivity. Also iterations don't make sense to search against a small database. Generally the workflow is to build a highly diverse query profile by searching against a large profile database (Uniclust) and then doing a HHsearch profile-profile search for maximum sensitivity.

I'll try to take a look at the issue anyway in the next few days. Thanks for the bug report!

rmostowy commented 5 years ago

Thank you Milot, that is quite helpful as I've only just started using the HHSuite. I'll reconsider my approach. Indeed, the example I used is based on a very small input, however eventually I would like to extend it to a large protein database.

One more question. Can one generate the 'msa' file, which is the output of the mmseqs result2msa command, with custom MSA fasta files? Basically, I cannot find anywhere steps to go from a number of custom MSAs:

data/msa0001.aln
data/msa0002.aln
.
.
.

etc., to a custom HHSuite database with the hhm, cs219 and c23m files. I searched for this extensively but didn't find any examples.

eli1199 commented 5 years ago

Hey @rmostowy , just a quick question about #3 in your steps, as I am hung on that step. I made my MSA using muscle and so all I have is one file as output (fa.out file). ffindex_apply in step #3 looks like it requires two files msa_ca3m.ff{data,index}, where did you get the index file from? Does mmseqs give you this? thanks for your help.