soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
529 stars 133 forks source link

build custom database #273

Closed hartrials closed 3 years ago

hartrials commented 3 years ago

Steps to Reproduce (for bugs)

I plan to use Hmmer software to generate multiple sequence alignment, and then use the results of Hmmer alignment to build a custom database.

  1. I run HMMER software to generate MSAs.
  2. I convert the HMMER MSAs Stockholm format to a3m format by hhsuite/scripts/reformat.pl.
  3. I use the pipeline to build custom database.
  4. I use the same query FASTA as the input of hhblits to compare the custom database which I built.

    My question

    Is the way I build a custom database reasonable?

    Command

    I get the HMMER MSAs and named it test.sto. And the next commands are as follows:

  5. reformat.pl sto a3m test.sto test.a3m
  6. ffindex_from_fasta -s new/first_a3m.ffdata new/first_a3m.ffindex test.a3m
  7. ffindex_apply -d new/new_a3m.ffdata -i new/new_a3m.ffindex new/first_a3m.ffdata new/first_a3m.ffindex -- hhconsensus -M 50 -i stdin -oa3m stdout -v 1
  8. ffindex_apply -d new/new_hhm.ffdata -i new/new_hhm.ffindex new/new_a3m.ffdata new/new_a3m.ffindex -- hhmake -i stdin -o stdout -v 1
  9. cstranslate -A cs219_lib -D context_data_lib -x 0.3 -c 4 -f -i new/new_a3m -o new/new_cs219 -I a3m
  10. Now I have my own database and I run hhblits command as follows: hhblits -i query.fasta -d new/new -o result.hhr -oa3m result.a3m -cpu 3 -qsc -30 -id 99 -cov 50 -diff inf

    My Environment

    Include as many relevant details about the environment you experienced the issue in.

    • Version/Git commit used: hhsuite 3.3.0
    • Server specifications (especially CPU support for AVX2/SSE and amount of system memory): CPU support for AVX2 and system memory 256G
    • Operating system and version: Ubuntu 16.04
milot-mirdita commented 3 years ago

This is valid but probably doesn't actually do what you expect it to do.

You probably want:

cd directory_containing_many_sto_files
ffindex_build -s ../stodb.ffdata ../stodb.ffindex .
cd ..
ffindex_apply -d first_a3m.ffdata -i first_a3m.ffindex stodb.ffdata stodb.ffindex -- reformat.pl sto a3m - - -v 0

I think you also want -v 0 for hhconsensus and hhmake

hartrials commented 3 years ago

I just want to build custom database to hhblits. Because HMMER can search fasta database not hhsuite database, I use more fasta database to generate more MSAs. But I know little about HMMER parameters. So I want to use the MSAs generated from HMMER to build custom hhsuite database. By the way, test.sto file is generated by jackhmmer. I use a single sequence query fasta to search fasta database. And I only generate one sto file. I edit my commands as follows:

Command

I get the HMMER MSAs and named it test.sto. And the next commands are as follows:

  1. reformat.pl sto a3m test.sto test.a3m
  2. ffindex_build -s new/first_a3m.ffdata new/first_a3m.ffindex test.a3m
  3. ffindex_apply -d new/new_a3m.ffdata -i new/new_a3m.ffindex new/first_a3m.ffdata new/first_a3m.ffindex -- hhconsensus -M 50 -i stdin -oa3m stdout -v 1
  4. ffindex_apply -d new/new_hhm.ffdata -i new/new_hhm.ffindex new/new_a3m.ffdata new/new_a3m.ffindex -- hhmake -i stdin -o stdout -v 1
  5. cstranslate -A cs219_lib -D context_data_lib -x 0.3 -c 4 -f -i new/new_a3m -o new/new_cs219 -I a3m
  6. Now I have my own database and I run hhblits command as follows: hhblits -i query.fasta -d new/new -o result.hhr -oa3m result.a3m -cpu 3 -qsc -30 -id 99 -cov 50 -diff inf

I'm sorry I have two more questions.

  1. What is the meaning of the parameter "-M 50" in step 3?
  2. Can I use "-M first" or "-M a2m" or "-M a3m" parameters?

Thank you very much for your patience in answering my question!

milot-mirdita commented 3 years ago

If you have just a single query sequence/MSA and a single target MSA, I would recommend to just use hhalign. Building a whole database for a single MSA doesn't really make a lot of sense.

The steps in your second post do work though to make a full hhblits database from a single MSA, it's just a bit odd.

The -M parameter decides which MSA columns are chosen as match states, generally -M 50 works well. It means every column that has less than 50% gaps is a match column. See towards the end of this section: https://github.com/soedinglab/hh-suite/wiki#searching-databases-of-hmms-using-hhsearch-and-hhblits

hartrials commented 3 years ago

Thank you very much!