soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

build a custom db for HHblits #286

Closed Ahmedroumia closed 2 years ago

Ahmedroumia commented 2 years ago

Dear All, I have a db of 54 thousand protein sequences and want to search every sequence against all the db seqs using hhblits on my linux machine. I have excuted the following: 1- run hhblits on each of your sequences to generate a3m and hhm files 2- assemble all the data (assuming all a3m files are in directory a3m/, all hhms in directory hhm/): a- ffindex_build db_a3m.ffdata db_a3m.ffindex a3m/ b- ffindex_build db_hhm.ffdata db_hhm.ffindex hhm/ c- LC_ALL=C sort db_hhm.ffindex > db_hhm.ffindex.simpleSort d -LC_ALL=C sort db_a3m.ffindex > db_a3m.ffindex.simpleSort e- mv db_a3m.ffindex db_full_a3m.ffindex.orig f- mv db_hhm.ffindex db_hhm.ffindex.orig g- ln -s db_a3m.ffindex.simpleSort db_a3m.ffindex h- ln -s db_hhm.ffindex.simpleSort db_hhm.ffindex i- export OMP_NUM_THREADS=$(nproc) j- cstranslate -A /usr/share/hhsuite/data/cs219.lib -D /usr/share/hhsuite/data/context_data.lib -x 0.3 -c 4 -f -i db_a3m -o db_cs219 -I a3m -b

The problem is in step J in running cstranslate command: it says Could not read entry: d1rgxcg.77.1.1.fasta, Message: Header of sequence 1 starts with: here is an example of my fasta seq file: seqs here is the error: error

Ahmedroumia commented 2 years ago

The problem has been solved !! Here you are the right commands "as mentioned in the user guide too" 1- ffindex_from_fasta -s scop2_06_fas.ff{data,index} scop2_06 2- hhblits_omp -i scop2_06 -d ../scop70_1_75 -oa3m scop2_06_a3m_wo_ss -n 2 -cpu 4 -v 0 3- ffindex_apply scop2_06_a3m_wo_ss.ff{data,index} -i scop2_06_a3m.ffindex -d scop2_06_a3m.ffdata -- addss.pl -v 0 stdin stdout. "optonal step" 4- cstranslate -f -x 0.3 -c 4 -I a3m -i scop2_06_a3m -o scop2_06_cs219

Citugulia40 commented 8 months ago

Hi,

Can you please let me know the exact steps for running HHblits? I have two FASTA files with 250 sequences and 2 million sequences and I want to search 2 million sequences against 250 sequences to find the homologs.

Please help

Thanks