build custom database - Githubissues

hartrials commented 3 years ago

Steps to Reproduce (for bugs)

I plan to use Hmmer software to generate multiple sequence alignment, and then use the results of Hmmer alignment to build a custom database.

I run HMMER software to generate MSAs.
I convert the HMMER MSAs Stockholm format to a3m format by hhsuite/scripts/reformat.pl.
I use the pipeline to build custom database.
I use the same query FASTA as the input of hhblits to compare the custom database which I built.
My question

Is the way I build a custom database reasonable?

Command

I get the HMMER MSAs and named it test.sto. And the next commands are as follows:
reformat.pl sto a3m test.sto test.a3m
ffindex_from_fasta -s new/first_a3m.ffdata new/first_a3m.ffindex test.a3m
ffindex_apply -d new/new_a3m.ffdata -i new/new_a3m.ffindex new/first_a3m.ffdata new/first_a3m.ffindex -- hhconsensus -M 50 -i stdin -oa3m stdout -v 1
ffindex_apply -d new/new_hhm.ffdata -i new/new_hhm.ffindex new/new_a3m.ffdata new/new_a3m.ffindex -- hhmake -i stdin -o stdout -v 1
cstranslate -A cs219_lib -D context_data_lib -x 0.3 -c 4 -f -i new/new_a3m -o new/new_cs219 -I a3m
Now I have my own database and I run hhblits command as follows: hhblits -i query.fasta -d new/new -o result.hhr -oa3m result.a3m -cpu 3 -qsc -30 -id 99 -cov 50 -diff inf
My Environment

Include as many relevant details about the environment you experienced the issue in.
- Version/Git commit used: hhsuite 3.3.0
- Server specifications (especially CPU support for AVX2/SSE and amount of system memory): CPU support for AVX2 and system memory 256G
- Operating system and version: Ubuntu 16.04

milot-mirdita commented 3 years ago

This is valid but probably doesn't actually do what you expect it to do.

If test.sto contains multiple MSAs it will discard all but the first.
ffindex_from_fasta will create a separate database entry (search target) for each sequence in the first MSA, the input for the next tools will thus be sequences and not MSAs anymore.

You probably want:

cd directory_containing_many_sto_files
ffindex_build -s ../stodb.ffdata ../stodb.ffindex .
cd ..
ffindex_apply -d first_a3m.ffdata -i first_a3m.ffindex stodb.ffdata stodb.ffindex -- reformat.pl sto a3m - - -v 0

I think you also want -v 0 for hhconsensus and hhmake

hartrials commented 3 years ago

I just want to build custom database to hhblits. Because HMMER can search fasta database not hhsuite database, I use more fasta database to generate more MSAs. But I know little about HMMER parameters. So I want to use the MSAs generated from HMMER to build custom hhsuite database. By the way, test.sto file is generated by jackhmmer. I use a single sequence query fasta to search fasta database. And I only generate one sto file. I edit my commands as follows:

Command

I get the HMMER MSAs and named it test.sto. And the next commands are as follows:

reformat.pl sto a3m test.sto test.a3m
ffindex_build -s new/first_a3m.ffdata new/first_a3m.ffindex test.a3m
ffindex_apply -d new/new_a3m.ffdata -i new/new_a3m.ffindex new/first_a3m.ffdata new/first_a3m.ffindex -- hhconsensus -M 50 -i stdin -oa3m stdout -v 1
ffindex_apply -d new/new_hhm.ffdata -i new/new_hhm.ffindex new/new_a3m.ffdata new/new_a3m.ffindex -- hhmake -i stdin -o stdout -v 1
cstranslate -A cs219_lib -D context_data_lib -x 0.3 -c 4 -f -i new/new_a3m -o new/new_cs219 -I a3m
Now I have my own database and I run hhblits command as follows: hhblits -i query.fasta -d new/new -o result.hhr -oa3m result.a3m -cpu 3 -qsc -30 -id 99 -cov 50 -diff inf

I'm sorry I have two more questions.

What is the meaning of the parameter "-M 50" in step 3?
Can I use "-M first" or "-M a2m" or "-M a3m" parameters?

Thank you very much for your patience in answering my question!

milot-mirdita commented 3 years ago

If you have just a single query sequence/MSA and a single target MSA, I would recommend to just use hhalign. Building a whole database for a single MSA doesn't really make a lot of sense.

The steps in your second post do work though to make a full hhblits database from a single MSA, it's just a bit odd.

The -M parameter decides which MSA columns are chosen as match states, generally -M 50 works well. It means every column that has less than 50% gaps is a match column. See towards the end of this section: https://github.com/soedinglab/hh-suite/wiki#searching-databases-of-hmms-using-hhsearch-and-hhblits

hartrials commented 3 years ago

Thank you very much!

soedinglab / hh-suite

build custom database #273

Steps to Reproduce (for bugs)

My question

Command

My Environment

Command