soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
541 stars 134 forks source link

building custom databases #165

Open DaMaoShan opened 5 years ago

DaMaoShan commented 5 years ago

Anyone tried building their own databases for hhblits ? I mean that from a protein fasta file to hhblits database.

I am using the pipeline listed in wiki of hhsuite: https://github.com/soedinglab/uniclust-pipeline

But it seems that there are lots of bugs.

Thanks in advance!

aschafu commented 5 years ago

Hi, yes, I build my own version of the pdb database (pdb_full) in the context of PSSH2 (database of sequence to structure alignments). I use the AWS cloud to run the steps (see https://github.com/aschafu/PSSH2/tree/master/src/cloud), but I guess you can extract the important bits and rework this for your problem:

  1. run hhblits on each of your sequences to generate a3m and hhm files
  2. assemble all the data (assuming all a3m files are in directory a3m/, all hhms in directory hhm/):
    /usr/share/hhsuite/bin/ffindex_build pdb_full_a3m.ffdata pdb_full_a3m.ffindex a3m/
    /usr/share/hhsuite/bin/ffindex_build pdb_full_hhm.ffdata pdb_full_hhm.ffindex hhm/
    LC_ALL=C sort pdb_full_hhm.ffindex > pdb_full_hhm.ffindex.simpleSort
    LC_ALL=C sort pdb_full_a3m.ffindex > pdb_full_a3m.ffindex.simpleSort
    mv pdb_full_a3m.ffindex pdb_full_a3m.ffindex.orig
    mv pdb_full_hhm.ffindex pdb_full_hhm.ffindex.orig
    ln -s pdb_full_a3m.ffindex.simpleSort pdb_full_a3m.ffindex
    ln -s pdb_full_hhm.ffindex.simpleSort pdb_full_hhm.ffindex
    export OMP_NUM_THREADS=$(nproc)
    /usr/share/hhsuite/bin/cstranslate  -A /usr/share/hhsuite/data/cs219.lib -D /usr/share/hhsuite/data/context_data.lib -x 0.3 -c 4 -f -i pdb_full_a3m -o pdb_full_cs219 -I a3m -b
    tar -h --transform "s,^,pdb_full_$dbDate/," --show-transformed-names -cvzf pdb_full_$dbDate.tgz pdb_full_a3m.ffdata  pdb_full_a3m.ffindex pdb_full_hhm.ffdata  pdb_full_hhm.ffindex pdb_full_cs219.ffdata  pdb_full_cs219.ffindex
DaMaoShan commented 5 years ago

Hi, yes, I build my own version of the pdb database (pdb_full) in the context of PSSH2 (database of sequence to structure alignments). I use the AWS cloud to run the steps (see https://github.com/aschafu/PSSH2/tree/master/src/cloud), but I guess you can extract the important bits and rework this for your problem:

  1. run hhblits on each of your sequences to generate a3m and hhm files
  2. assemble all the data (assuming all a3m files are in directory a3m/, all hhms in directory hhm/):
/usr/share/hhsuite/bin/ffindex_build pdb_full_a3m.ffdata pdb_full_a3m.ffindex a3m/
/usr/share/hhsuite/bin/ffindex_build pdb_full_hhm.ffdata pdb_full_hhm.ffindex hhm/
LC_ALL=C sort pdb_full_hhm.ffindex > pdb_full_hhm.ffindex.simpleSort
LC_ALL=C sort pdb_full_a3m.ffindex > pdb_full_a3m.ffindex.simpleSort
mv pdb_full_a3m.ffindex pdb_full_a3m.ffindex.orig
mv pdb_full_hhm.ffindex pdb_full_hhm.ffindex.orig
ln -s pdb_full_a3m.ffindex.simpleSort pdb_full_a3m.ffindex
ln -s pdb_full_hhm.ffindex.simpleSort pdb_full_hhm.ffindex
export OMP_NUM_THREADS=$(nproc)
/usr/share/hhsuite/bin/cstranslate  -A /usr/share/hhsuite/data/cs219.lib -D /usr/share/hhsuite/data/context_data.lib -x 0.3 -c 4 -f -i pdb_full_a3m -o pdb_full_cs219 -I a3m -b
tar -h --transform "s,^,pdb_full_$dbDate/," --show-transformed-names -cvzf pdb_full_$dbDate.tgz pdb_full_a3m.ffdata  pdb_full_a3m.ffindex pdb_full_hhm.ffdata  pdb_full_hhm.ffindex pdb_full_cs219.ffdata  pdb_full_cs219.ffindex

Very thanks for you. I will try it as soon as possible.

DaMaoShan commented 5 years ago

Hi, yes, I build my own version of the pdb database (pdb_full) in the context of PSSH2 (database of sequence to structure alignments). I use the AWS cloud to run the steps (see https://github.com/aschafu/PSSH2/tree/master/src/cloud), but I guess you can extract the important bits and rework this for your problem:

  1. run hhblits on each of your sequences to generate a3m and hhm files
  2. assemble all the data (assuming all a3m files are in directory a3m/, all hhms in directory hhm/):
/usr/share/hhsuite/bin/ffindex_build pdb_full_a3m.ffdata pdb_full_a3m.ffindex a3m/
/usr/share/hhsuite/bin/ffindex_build pdb_full_hhm.ffdata pdb_full_hhm.ffindex hhm/
LC_ALL=C sort pdb_full_hhm.ffindex > pdb_full_hhm.ffindex.simpleSort
LC_ALL=C sort pdb_full_a3m.ffindex > pdb_full_a3m.ffindex.simpleSort
mv pdb_full_a3m.ffindex pdb_full_a3m.ffindex.orig
mv pdb_full_hhm.ffindex pdb_full_hhm.ffindex.orig
ln -s pdb_full_a3m.ffindex.simpleSort pdb_full_a3m.ffindex
ln -s pdb_full_hhm.ffindex.simpleSort pdb_full_hhm.ffindex
export OMP_NUM_THREADS=$(nproc)
/usr/share/hhsuite/bin/cstranslate  -A /usr/share/hhsuite/data/cs219.lib -D /usr/share/hhsuite/data/context_data.lib -x 0.3 -c 4 -f -i pdb_full_a3m -o pdb_full_cs219 -I a3m -b
tar -h --transform "s,^,pdb_full_$dbDate/," --show-transformed-names -cvzf pdb_full_$dbDate.tgz pdb_full_a3m.ffdata  pdb_full_a3m.ffindex pdb_full_hhm.ffdata  pdb_full_hhm.ffindex pdb_full_cs219.ffdata  pdb_full_cs219.ffindex

I am very sorry that could you tell me how many sequences do you have in your protein fasta file? Thank you in advance!

aschafu commented 5 years ago

Sorry, had overlooked the mails.

I am very sorry that could you tell me how many sequences do you have in your protein fasta file? I am not sure which fasta file you mean. But I guess you want to know how many sequences my database contains (in my setup number of files in the a3m and hhm directories)? That is on the order of 100k.

DaMaoShan commented 4 years ago

Sorry, had overlooked the mails.

I am very sorry that could you tell me how many sequences do you have in your protein fasta file? I am not sure which fasta file you mean. But I guess you want to know how many sequences my database contains (in my setup number of files in the a3m and hhm directories)? That is on the order of 100k.

======================================== Sorry! Recently I had focused on another projector. Today I restart trying to make my own hh-database. My fasta file is env_nr.gz from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. This is on the order of 10^6

GodPCWANG commented 1 year ago

Hello, did you solve this issue?