Closed rpaul5 closed 2 years ago
Hi there,
Thank you for using our tool. I just tried to download this db and used diamond to install it and it works well without any error. I am just curious about what diamond version you are using. My diamond version currently is 2.0.13. Please activate your rundbcan conda environment and input this command diamond --version
diamond --version
diamond version 2.0.13
It's not to do with DIAMOND
, you need to delete all content in the db
directory (presuming you performed the git clone
of the repository) because there is already a CAZyDB.09242021.fa
and it doesn't contain fasta data. Instead CAZyDB.09242021.fa
contains:
version https://git-lfs.github.com/spec/v1
oid sha256:73542830702ff7d2aed13e61160d519dd4a912ea2ee3ff2c50efea3d92c5077a
size 1064380410
Owing to wget
not (by default) overwrite existing files, the command:
wget http://bcb.unl.edu/dbCAN2/download/CAZyDB.09242021.fa && diamond makedb --in CAZyDB.09242021.fa -d CAZy
creates a file called CAZyDB.09242021.fa.1
not CAZyDB.09242021.fa
because the file CAZyDB.09242021.fa
already exists in thedb
directory. But the second stage of the command:
diamond makedb --in CAZyDB.09242021.fa -d CAZy
instructs DIAMOND
to use the CAZyDB.09242021.fa
(which doesn't contain FASTA data) instead of the freshly downloaded CAZyDB.09242021.fa.1
file (which does contain FASTA data).
Therefore, easiest fix is to delete all content in the db
directory, otherwise an error will be raised every time DIAMOND
and HMMER
compile a database.
A suggestion for the devs: the ease of install dbCAN could potentially be improved by including a single script (e.g. a bash script) or entry point that is packaged up with dbCAN and handles installing all the dependencies. Then as dbCAN is updated this script is also updated, and the script could be designed to handle updating the dependencies (including handling version conflicts), and help mitigate clashes like this in the future, and make installing dbCAN a cleaner. It's obviously not a necessary thing but it helps polish off a tool and fewer installation steps can sometimes encourage more users to run the tool locally (reducing demand on your servers).
gooey
or wooey
could help with packaging the tool to make distribution easier for end users.
Thanks a lot Emma. I will delete the db
from the repo of run_dbcan repo. I intended to let users download DB and use git-lfs so that users don't need to do extra effort on it. but it exceeds the upper limit. I would consider integrating the bash code into dbcan cli so that users don't need extra effort to run that bash code.
I will also remind user don't do git clone for my repo
Your welcome, least I can do seeing as dbCAN is so useful 😃
Are you compressing the files before using git-lfs
? Compressing all the files (including the raw FASTA files and the database files) reduces the file size to 1.3GB. Supplying either the raw data files or the db files would reduce the size further. Both cases should bring you well below the git-lfs
storage limit. A bash script could retrieve the files, then gunzip db/*
or tar db/*
for example, would decompress them, followed by the commands for compiling the DIAMOND and HMMER dbs.
Using git-lfs
could facilitate making possible to use a single command to download and install dbCAN but might you will running the risk of the datafiles used by the web-server becoming out of sync with those used by the standalone tool (I am presuming the web-server is currently retrieving its datafiles from the same links provided to users for downloading the data files). Having separate copies of the datafiles for the web-server and the standalone version would certainly increase the workload in terms of maintenance Best practise dictates that the webserver and standalone versions should access the same datafiles from the same location, but best practice isn't always practical or achievable when there's a deadline.
If you need to or want to store the data files in one place (i.e. a single dataset that the web version and the standalone version access) you can set a setup.py
to run additional scripts for you that could retrieve the datafiles from an external server and parse them as required. This can be done via adding installation options. Then you need only call setup.py
followed by the name of the command you assigned to install the additional dependencies. So you could have something like
git clone repo
python3 setup.py compile_dbs
where compile_dbs
is the name of the command that will instruct setup.py to run you bash script(s) that download and/or compile the databases.
I don't know if there is a why to tell pip
/pypi
to run the additional installation option when installing the tool, if there it then that could create a single installation cmd for dbcan. If there isn't it still reduces the number of commands to install dbCAN to 4:
These are merely suggestions but hope they help spark some ideas at least 😄
I want users to download the PyPI since PyPI provides a more stable release and the repo version is not stable. I actually wanted to wrap up our code into Bioconda but it is unsuccessful.
We already put it into Bioconda. Many thanks!
Hello,
I have recently attempted to install. Whenever you reach step 4 I receive the error: Error reading input stream at line 1: FASTA format error: Missing '>' at record start . This error occurs from the first wget link http://bcb.unl.edu/dbCAN2/download/CAZyDB.09242021.fa && diamond makedb --in CAZyDB.09242021.fa -d CAZy . To combat this I downloaded a previous version [CAZyDB.07312020.fa (https://bcb.unl.edu/dbCAN2/download/CAZyDB.07312020.fa) and will be updating the database. This may help others who are attempting to install.