soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
547 stars 135 forks source link

custom database creation error on ctranslate step #204

Open nick-youngblut opened 4 years ago

nick-youngblut commented 4 years ago

Expected Behavior

Custom database created for dbCAN v8.

Current Behavior

Error during the cstranslate step.

Steps to Reproduce (for bugs)

# creating custom dbCAN hhsuite database
## download MSA from http://bcb.unl.edu/dbCAN2/download/ (and uncompress)
http://bcb.unl.edu/dbCAN2/download/dbCAN-fam-aln-V8.tar.gz
tar -pzxvf dbCAN-fam-aln-V8.tar.gz
## build from MSAs
cd dbCAN-fam-aln-V8
ffindex_build -s ../dbCAN-fam-aln-V8.ff{data,index} .
cd ../
## concensus
ffindex_apply dbCAN-fam-aln-V8.ffdata dbCAN-fam-aln-V8.ffindex -i dbCAN-fam-aln-V8_a3m.ffindex -d dbCAN-fam-aln-V8_a3m.ffdata -- hhconsensus -M 50 -maxres 65535 -i stdin -oa3m stdout -v 0
## hmm 
ffindex_apply dbCAN-fam-aln-V8_a3m.ff{data,index} -i dbCAN-fam-aln-V8_hhm.ffindex -d dbCAN-fam-aln-V8_hhm.ffdata -- hhmake -i stdin -o stdout -v 0
## context states
cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m -o dbCAN-fam-aln-V8_cs219 

HH-suite Output (for bugs)

If using cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m -o dbCAN-fam-aln-V8_cs219:

Reading context library for pseudocounts from internal ...
Reading abstract state alphabet from internal ...

ERROR: Unable to read input file 'dbCAN-fam-aln-V8_a3m'!

If using cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m.ffdata -o dbCAN-fam-aln-V8_cs219:

Reading context library for pseudocounts from internal ...
Reading abstract state alphabet from internal ...

ERROR: Sequence 468 has 181 match columns but should have 613!

Your Environment

Ubuntu 18.04.4

# conda env
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       0_gnu    conda-forge
bzip2                     1.0.8                h516909a_2    conda-forge
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py37hc8dfbb8_0    conda-forge
curl                      7.69.1               h33f0ec9_0    conda-forge
fqtools                   2.0                  hc0aa232_5    bioconda
hhsuite                   3.2.0           py37pl526h3340039_1    bioconda
htslib                    1.9                  h4da6232_3    bioconda
krb5                      1.17.1               h2fd8d38_0    conda-forge
ld_impl_linux-64          2.34                 h53a641e_5    conda-forge
libcurl                   7.69.1               hf7181ac_0    conda-forge
libdeflate                1.6                  h516909a_0    conda-forge
libedit                   3.1.20191231         h46ee950_0    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.2.0                h24d8f2e_2    conda-forge
libgomp                   9.2.0                h24d8f2e_2    conda-forge
libssh2                   1.9.0                hab1572f_2    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_2    conda-forge
llvm-openmp               8.0.1                hc9558a2_0    conda-forge
ncurses                   6.1               hf484d3e_1002    conda-forge
openmp                    8.0.1                         0    conda-forge
openssl                   1.1.1g               h516909a_0    conda-forge
perl                      5.26.2            h516909a_1006    conda-forge
pip                       20.1.1                     py_1    conda-forge
python                    3.7.6           cpython_h8356626_6    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
seqkit                    0.12.1                        0    bioconda
setuptools                47.3.1           py37hc8dfbb8_0    conda-forge
sqlite                    3.30.1               hcee41ef_0    conda-forge
taxonkit                  0.5.0                         0    bioconda
tk                        8.6.10               hed695b0_0    conda-forge
wheel                     0.34.2                     py_1    conda-forge
xz                        5.2.5                h516909a_0    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge
milot-mirdita commented 4 years ago

Ah I've been meaning to build a database from dbCAN since a while, thanks for the reminder.

I tried to reproduce building the database and it works correctly with the *_mpi binaries.

Something like this works for me:

DB=dbCAN-fam-V8
wget http://bcb.unl.edu/dbCAN2/download/dbCAN-fam-aln-V8.tar.gz
tar xzvf dbCAN-fam-aln-V8.tar.gz
cd dbCAN-fam-aln;
ffindex_build -s ../${DB}_msa.ff{data,index} .
cd ..
sed 's|\.aln||g' ${DB}_msa.ffindex > ${DB}_msa_renamed.ffindex
mv ${DB}_msa_renamed.ffindex ${DB}_msa.ffindex
mpirun -np 16 ffindex_apply_mpi ${DB}_msa.ffdata ${DB}_msa.ffindex -i ${DB}_a3m.ffindex -d ${DB}_a3m.ffdata -- hhconsensus -M 50 -maxres 65535 -i stdin -oa3m stdout -v 0
mpirun -np 16 ffindex_apply_mpi ${DB}_a3m.ff{data,index} -i ${DB}_hhm.ffindex -d ${DB}_hhm.ffdata -- hhmake -i stdin -o stdout -v 0
mpirun -np 16 cstranslate_mpi -x 0.3 -c 4 -I a3m -i ${DB}_a3m -o ${DB}_cs219
# reorder according to cs219 for better access patterns
sort -k 3 -n ${DB}_cs219.ffindex | cut -f1 > ${DB}.list
for type in a3m hhm; do
    ffindex_order ${DB}.list ${DB}_${type}.ffdata ${DB}_${type}.ffindex ${DB}_${type}_opt.ffdata ${DB}_${type}_opt.ffindex
    mv -f ${DB}_${type}_opt.ffdata ${DB}_${type}.ffdata
    mv -f ${DB}_${type}_opt.ffindex ${DB}_${type}.ffindex
done
md5deep ${DB}_{a3m,hhm,cs219}.ff{data,index} > ${DB}.md5sum
tar czvf ${DB}.tar.gz ${DB}_{a3m,hhm,cs219}.ff{data,index} ${DB}.md5sum

I took the liberty to build this database and put it on our file server: http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/dbCAN-fam-V8.tar.gz

I would recommend to search through it with HHsearch instead of HHblits though. Due to it's small size HHsearch can still easily handle it and it will be more sensitive.

gancao commented 4 years ago

Hello? I want to know how you get the *_mpi binaries? The document didn't declare the process of installing hh-suite with MPI support? Could you please tell me how to do it? Thanks! I also met the problem `Reading context library for pseudocounts from context_data.lib ... Reading abstract state alphabet from cs219.lib ...

ERROR: Sequence 1 has 764 match columns but should have 2021! `

milot-mirdita commented 4 years ago

I added a section to the wiki: https://github.com/soedinglab/hh-suite/wiki#mpi-support

I think you were missing the -f or --ffindex flag of cstranslate to switch from single file mode to database read in. That might be what was causing the error message.

milot-mirdita commented 3 years ago

I made a new DB for V9: http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/dbCAN-fam-V9.tar.gz

The dbCAN team thankfully provided the raw alignments for the new release.