FASTA format error: Missing '>' at record start

rpaul5 commented 2 years ago

Hello,

I have recently attempted to install. Whenever you reach step 4 I receive the error: Error reading input stream at line 1: FASTA format error: Missing '>' at record start . This error occurs from the first wget link http://bcb.unl.edu/dbCAN2/download/CAZyDB.09242021.fa && diamond makedb --in CAZyDB.09242021.fa -d CAZy . To combat this I downloaded a previous version [CAZyDB.07312020.fa (https://bcb.unl.edu/dbCAN2/download/CAZyDB.07312020.fa) and will be updating the database. This may help others who are attempting to install.

linnabrown commented 2 years ago

Hi there, Thank you for using our tool. I just tried to download this db and used diamond to install it and it works well without any error. I am just curious about what diamond version you are using. My diamond version currently is 2.0.13. Please activate your rundbcan conda environment and input this command diamond --version

diamond --version
diamond version 2.0.13

HobnobMancer commented 2 years ago

It's not to do with DIAMOND, you need to delete all content in the db directory (presuming you performed the git clone of the repository) because there is already a CAZyDB.09242021.fa and it doesn't contain fasta data. Instead CAZyDB.09242021.fa contains:

version https://git-lfs.github.com/spec/v1
oid sha256:73542830702ff7d2aed13e61160d519dd4a912ea2ee3ff2c50efea3d92c5077a
size 1064380410

Owing to wget not (by default) overwrite existing files, the command:

wget http://bcb.unl.edu/dbCAN2/download/CAZyDB.09242021.fa && diamond makedb --in CAZyDB.09242021.fa -d CAZy

creates a file called CAZyDB.09242021.fa.1 not CAZyDB.09242021.fa because the file CAZyDB.09242021.fa already exists in thedb directory. But the second stage of the command:

diamond makedb --in CAZyDB.09242021.fa -d CAZy

instructs DIAMOND to use the CAZyDB.09242021.fa (which doesn't contain FASTA data) instead of the freshly downloaded CAZyDB.09242021.fa.1 file (which does contain FASTA data).

Therefore, easiest fix is to delete all content in the db directory, otherwise an error will be raised every time DIAMOND and HMMER compile a database.

A suggestion for the devs: the ease of install dbCAN could potentially be improved by including a single script (e.g. a bash script) or entry point that is packaged up with dbCAN and handles installing all the dependencies. Then as dbCAN is updated this script is also updated, and the script could be designed to handle updating the dependencies (including handling version conflicts), and help mitigate clashes like this in the future, and make installing dbCAN a cleaner. It's obviously not a necessary thing but it helps polish off a tool and fewer installation steps can sometimes encourage more users to run the tool locally (reducing demand on your servers). gooey or wooey could help with packaging the tool to make distribution easier for end users.

linnabrown commented 2 years ago

Thanks a lot Emma. I will delete the db from the repo of run_dbcan repo. I intended to let users download DB and use git-lfs so that users don't need to do extra effort on it. but it exceeds the upper limit. I would consider integrating the bash code into dbcan cli so that users don't need extra effort to run that bash code.

I will also remind user don't do git clone for my repo

HobnobMancer commented 2 years ago

Your welcome, least I can do seeing as dbCAN is so useful 😃

Are you compressing the files before using git-lfs? Compressing all the files (including the raw FASTA files and the database files) reduces the file size to 1.3GB. Supplying either the raw data files or the db files would reduce the size further. Both cases should bring you well below the git-lfs storage limit. A bash script could retrieve the files, then gunzip db/* or tar db/* for example, would decompress them, followed by the commands for compiling the DIAMOND and HMMER dbs.

Using git-lfs could facilitate making possible to use a single command to download and install dbCAN but might you will running the risk of the datafiles used by the web-server becoming out of sync with those used by the standalone tool (I am presuming the web-server is currently retrieving its datafiles from the same links provided to users for downloading the data files). Having separate copies of the datafiles for the web-server and the standalone version would certainly increase the workload in terms of maintenance Best practise dictates that the webserver and standalone versions should access the same datafiles from the same location, but best practice isn't always practical or achievable when there's a deadline.

If you need to or want to store the data files in one place (i.e. a single dataset that the web version and the standalone version access) you can set a setup.py to run additional scripts for you that could retrieve the datafiles from an external server and parse them as required. This can be done via adding installation options. Then you need only call setup.py followed by the name of the command you assigned to install the additional dependencies. So you could have something like

git clone repo
python3 setup.py compile_dbs

where compile_dbs is the name of the command that will instruct setup.py to run you bash script(s) that download and/or compile the databases.

I don't know if there is a why to tell pip/pypi to run the additional installation option when installing the tool, if there it then that could create a single installation cmd for dbcan. If there isn't it still reduces the number of commands to install dbCAN to 4:

create the venv + install diamond, hmmer etc
activate the venv
clone the repo
install dependencies using setup.py

These are merely suggestions but hope they help spark some ideas at least 😄

linnabrown commented 2 years ago

I want users to download the PyPI since PyPI provides a more stable release and the repo version is not stable. I actually wanted to wrap up our code into Bioconda but it is unsuccessful.

linnabrown commented 2 years ago

We already put it into Bioconda. Many thanks!

linnabrown / run_dbcan

FASTA format error: Missing '>' at record start #90