jolespin / veba

A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes
GNU Affero General Public License v3.0
76 stars 8 forks source link

[Question] CHECKVDB environment variable NOT set #28

Closed javiercnav closed 11 months ago

javiercnav commented 1 year ago

Please confirm that you've checked the FAQ section: https://github.com/jolespin/veba/blob/main/FAQ.md Checked If you still have a question, feel free to ask here.

Hello there, I have installed VEBA and run the check_installation.sh script. The output shows that almost all passes, but the Checkvdb environmental variable See this:

./check_installation.sh [Pass] VEBA-annotate_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-annotate_env [Pass] VEBA-assembly_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-assembly_env [Pass] VEBA-binning-eukaryotic_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-eukaryotic_env [Pass] VEBA-binning-prokaryotic_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-prokaryotic_env [Pass] VEBA-binning-viral_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-viral_env [Pass] VEBA-biosynthetic_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-biosynthetic_env [Pass] VEBA-classify_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-classify_env [Pass] VEBA-cluster_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-cluster_env [Pass] VEBA-database_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-database_env [Pass] VEBA-mapping_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-mapping_env [Pass] VEBA-phylogeny_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-phylogeny_env [Pass] VEBA-preprocess_env SUCCESSFULLY created. [Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-preprocess_env [Pass] CHECKM2DB environment SUCCESSFULLY set in VEBA-binning-prokaryotic_env [Pass] GTDBTK_DATA_PATH environment SUCCESSFULLY set in VEBA-classify_env [Fail] CHECKVDB environment variable NOT set in VEBA-binning-viral_env

Is there anything I could do to make sure the Checkvdb environment var is correct?

jolespin commented 12 months ago

This is surprising but thanks for bringing it to my attention.

Can you confirm the CheckV database downloaded here?

du -sh ${VEBA_DATABASE}/Classify/CheckV/*
4.3G    /expanse/projects/jcl110/db/veba/VDB_v5.1/Classify/CheckV/genome_db
2.2G    /expanse/projects/jcl110/db/veba/VDB_v5.1/Classify/CheckV/hmm_db
12K /expanse/projects/jcl110/db/veba/VDB_v5.1/Classify/CheckV/README.txt
javiercnav commented 11 months ago

Hello there, Here is my output: du -sh ${VEBA_DATABASE}/Classify/CheckV/* 3.4G /projects/navarro_lab/databases/veba_db/Classify/CheckV/genome_db 688M /projects/navarro_lab/databases/veba_db/Classify/CheckV/hmm_db 4.0K /projects/navarro_lab/databases/veba_db/Classify/CheckV/README.txt

jolespin commented 11 months ago

Interesting the file sizes are different but that could be some backup snapshots in my directory.

Can you run a test with CheckV using the database you downloaded?

If it works, then you should be able to create these files:

[path/to/conda]/envs/VEBA-binning-viral_env/etc/conda/activate.d/veba.sh

Contents would be this:

export VEBA_DATABASE=/projects/navarro_lab/databases/veba_db/
export CHECKVDB=/projects/navarro_lab/databases/veba_db/Classify/CheckV

and then this one:

/expanse/projects/jcl110/anaconda3/envs/VEBA-binning-viral_env/etc/conda/deactivate.d/veba.sh

Contents would be this:

unset VEBA_DATABASE
unset CHECKVDB

Here's what my files look like:

(base) [jespinoz@login02 activate.d]$ cat /expanse/projects/jcl110/anaconda3/envs/VEBA-binning-viral_env/etc/conda/activate.d/veba.sh
export VEBA_DATABASE=/expanse/projects/jcl110/db/veba/VDB_v5.1
export CHECKVDB=/expanse/projects/jcl110/db/veba/VDB_v5.1/Classify/CheckV

(base) [jespinoz@login02 activate.d]$ cat /expanse/projects/jcl110/anaconda3/envs/VEBA-binning-viral_env/etc/conda/deactivate.d/veba.sh
unset VEBA_DATABASE
unset CHECKVDB

Luckily VEBA-binning-viral_env is one of the easier environments to configure so if you have to set it up or download those dbs again, it should be real quick!

Let me know if you get it to work or need any more help getting it set up!

javiercnav commented 11 months ago

Hello, I ran the checkv test with the database I downloaded and it works:

checkv end_to_end final-viral-combined.fasta checkv -d /projects/navarro_lab/databases/veba_db/Classify/CheckV/

CheckV v1.0.1: contamination
[1/8] Reading database info...
[2/8] Reading genome info...
[3/8] Calling genes with Prodigal...
[4/8] Reading gene info...
[5/8] Running hmmsearch...
[6/8] Annotating genes...
[7/8] Identifying host regions...
[8/8] Writing results...
Run time: 632.45 seconds
Peak mem: 0.28 GB

CheckV v1.0.1: completeness
[1/8] Skipping gene calling...
[2/8] Initializing queries and database...
[3/8] Running DIAMOND blastp search...
[4/8] Computing AAI...
[5/8] Running AAI based completeness estimation...
[6/8] Running HMM based completeness estimation...
[7/8] Determining genome copy number...
[8/8] Writing results...
Run time: 88.22 seconds
Peak mem: 1.85 GB

CheckV v1.0.1: complete_genomes
[1/7] Reading input sequences...
[2/7] Finding complete proviruses...
[3/7] Finding direct/inverted terminal repeats...
[4/7] Filtering terminal repeats...
[5/7] Checking genome for completeness...
[6/7] Checking genome for large duplications...
[7/7] Writing results...
Run time: 0.13 seconds
Peak mem: 1.85 GB

CheckV v1.0.1: quality_summary
[1/6] Reading input sequences...
[2/6] Reading results from contamination module...
[3/6] Reading results from completeness module...
[4/6] Reading results from complete genomes module...
[5/6] Classifying contigs into quality tiers...
[6/6] Writing results...
Run time: 0.03 seconds
Peak mem: 1.85 GB

After that, I created the veba.sh files as suggested, see content:

cat /projects/navarro_lab/envs/VEBA-binning-viral_env/etc/conda/activate.d/veba.sh
export VEBA_DATABASE=/projects/navarro_lab/databases/veba_db/
export CHECKVDB=/projects/navarro_lab/databases/veba_db/Classify/CheckV

cat /projects/navarro_lab/envs/VEBA-binning-viral_env/etc/conda/deactivate.d/veba.sh
unset VEBA_DATABASE
unset CHECKVDB

Once that was done, I re-run ./check_installation.sh, and it seems that the environment variables for other databases got messed up:

(base) [jc4675@wind .../install ]$ ./check_installation.sh
[Pass] VEBA-annotate_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-annotate_env
[Pass] VEBA-assembly_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-assembly_env
[Pass] VEBA-binning-eukaryotic_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-eukaryotic_env
[Pass] VEBA-binning-prokaryotic_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-prokaryotic_env
[Pass] VEBA-binning-viral_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-viral_env
[Pass] VEBA-biosynthetic_env SUCCESSFULLY created.
[Fail] VEBA_DATABASE environment variable NOT set in VEBA-biosynthetic_env
[Pass] VEBA-classify_env SUCCESSFULLY created.
[Fail] VEBA_DATABASE environment variable NOT set in VEBA-classify_env
[Pass] VEBA-cluster_env SUCCESSFULLY created.
[Fail] VEBA_DATABASE environment variable NOT set in VEBA-cluster_env
[Pass] VEBA-database_env SUCCESSFULLY created.
[Fail] VEBA_DATABASE environment variable NOT set in VEBA-database_env
[Pass] VEBA-mapping_env SUCCESSFULLY created.
[Fail] VEBA_DATABASE environment variable NOT set in VEBA-mapping_env
[Pass] VEBA-phylogeny_env SUCCESSFULLY created.
[Fail] VEBA_DATABASE environment variable NOT set in VEBA-phylogeny_env
[Pass] VEBA-preprocess_env SUCCESSFULLY created.
[Fail] VEBA_DATABASE environment variable NOT set in VEBA-preprocess_env
[Pass] CHECKM2DB environment SUCCESSFULLY set in VEBA-binning-prokaryotic_env
[Pass] GTDBTK_DATA_PATH environment SUCCESSFULLY set in VEBA-classify_env
[Pass] CHECKVDB environment SUCCESSFULLY set in VEBA-binning-viral_env
jolespin commented 11 months ago

Let's see what this looks like:

cat /projects/navarro_lab/envs/VEBA-preprocess_env/etc/conda/activate.d/veba.sh
cat /projects/navarro_lab/envs/VEBA-preprocess_env/etc/conda/deactivate.d/veba.sh

This should do the trick:

bash veba/update_environment_variables.sh /projects/navarro_lab/databases/veba_db/

Where veba/ is the repository you downloaded

javiercnav commented 11 months ago

It seems that I need to manually create the veba.sh files for each of the databases whose variables are not set: See the outputs: cat /projects/navarro_lab/envs/VEBA-preprocess_env/etc/conda/activate.d/veba.sh cat: /projects/navarro_lab/envs/VEBA-preprocess_env/etc/conda/activate.d/veba.sh: No such file or directory

$ cat /projects/navarro_lab/envs/VEBA-preprocess_env/etc/conda/deactivate.d/veba.sh cat: /projects/navarro_lab/envs/VEBA-preprocess_env/etc/conda/deactivate.d/veba.sh: No such file or directory

Still, I did: bash update_environment_variables.sh /projects/navarro_lab/databases/veba_db/

. .. ... ..... ........ .............
i * Adding the following environment variable to VEBA environments: export VEBA_DATABASE=/projects/navarro_lab/databases/veba_db
/packages/mambaforge/23.3.1/envs/VEBA-*
mkdir: cannot create directory ‘/packages/mambaforge/23.3.1/envs/VEBA-*’: Permission denied
mkdir: cannot create directory ‘/packages/mambaforge/23.3.1/envs/VEBA-*’: Permission denied
update_environment_variables.sh: line 18: /packages/mambaforge/23.3.1/envs/VEBA-*/etc/conda/activate.d/veba.sh: No such file or directory
update_environment_variables.sh: line 19: /packages/mambaforge/23.3.1/envs/VEBA-*/etc/conda/deactivate.d/veba.sh: No such file or directory
. .. ... ..... ........ .............
xiii * Adding the following environment variable to VEBA environments: export CHECKM2DB=/projects/navarro_lab/databases/veba_db/Classify/CheckM2/uniref100.KO.1.dmnd
update_environment_variables.sh: line 28: /packages/mambaforge/23.3.1/envs/VEBA-binning-prokaryotic_env/etc/conda/activate.d/veba.sh: No such file or directory
update_environment_variables.sh: line 29: /packages/mambaforge/23.3.1/envs/VEBA-binning-prokaryotic_env/etc/conda/deactivate.d/veba.sh: No such file or directory
. .. ... ..... ........ .............
xiv * Adding the following environment variable to VEBA environments: export GTDBTK_DATA_PATH=/projects/navarro_lab/databases/veba_db/Classify/GTDB/
update_environment_variables.sh: line 38: /packages/mambaforge/23.3.1/envs/VEBA-classify_env/etc/conda/activate.d/veba.sh: No such file or directory
update_environment_variables.sh: line 39: /packages/mambaforge/23.3.1/envs/VEBA-classify_env/etc/conda/deactivate.d/veba.sh: No such file or directory
. .. ... ..... ........ .............
xv * Adding the following environment variable to VEBA environments: export CHECKVDB=/projects/navarro_lab/databases/veba_db/Classify/CheckV/
update_environment_variables.sh: line 47: /packages/mambaforge/23.3.1/envs/VEBA-binning-viral_env/etc/conda/activate.d/veba.sh: No such file or directory
update_environment_variables.sh: line 48: /packages/mambaforge/23.3.1/envs/VEBA-binning-viral_env/etc/conda/deactivate.d/veba.sh: No such file or directory
 _    _ _______ ______  _______
  \  /  |______ |_____] |_____|
   \/   |______ |_____] |     |
.........................................
  Environment Variable Update Complete
.........................................
The VEBA database environment variable is set in your VEBA conda environments:
    VEBA_DATABASE=/projects/navarro_lab/databases/veba_db

Thanks a lot for your help!

jolespin commented 11 months ago

Just to be clear, you made those veba.sh files in the activate/decativate directories, then ran update_environment_variables.sh and everything worked as expected?

javiercnav commented 11 months ago

OK. I figured it out!

Let me first say the following: I am using veba through a university hyper-computing cluster. There, my home directory has limited storage (10 Gb), which is not enough to create the veba environments, even less to download its associated databases. To install veba, I had to redirect the installation of my conda environments and download databases to /projects/navarrolab/envs and /projects/navarro_lab/databases/veba_db, respectively. Because of all that, I had to modify your scripts to redirect environment variables to the actual location of my conda environments.

For example, the script update_environmental_variable.sh was modified from:

#!/bin/bash
# __version__ = "2023.6.14"

# Create database
DATABASE_DIRECTORY=${1:-"."}
REALPATH_DATABASE_DIRECTORY=$(realpath $DATABASE_DIRECTORY)
# CONDA_BASE=$(which conda | python -c "import sys; print('/'.join(sys.stdin.read().split('/')[:-2]))")
CONDA_BASE=$(conda run -n base bash -c "echo \${CONDA_PREFIX}")

to:

#!/bin/bash
# __version__ = "2023.6.14"

module load mambaforge

# Create database
DATABASE_DIRECTORY=${1:-"."}
REALPATH_DATABASE_DIRECTORY=$(realpath $DATABASE_DIRECTORY)
# CONDA_BASE=$(which conda | python -c "import sys; print('/'.join(sys.stdin.read().split('/')[:-2]))")
CONDA_BASE="$(conda info | grep -oP '(?<=envs directories : ).*$')"
CONDA_BASE="${CONDA_BASE%/*}"

I re-run update_environment_variables_mod.sh and got:

. .. ... ..... ........ .............
i * Adding the following environment variable to VEBA environments: export VEBA_DATABASE=/scratch/jc4675/veba/install
/projects/navarro_lab/envs/VEBA-annotate_env
/projects/navarro_lab/envs/VEBA-assembly_env
/projects/navarro_lab/envs/VEBA-binning-eukaryotic_env
/projects/navarro_lab/envs/VEBA-binning-prokaryotic_env
/projects/navarro_lab/envs/VEBA-binning-viral_env
/projects/navarro_lab/envs/VEBA-biosynthetic_env
/projects/navarro_lab/envs/VEBA-classify_env
mkdir: created directory '/projects/navarro_lab/envs/VEBA-classify_env/etc/conda'
mkdir: created directory '/projects/navarro_lab/envs/VEBA-classify_env/etc/conda/activate.d/'
mkdir: created directory '/projects/navarro_lab/envs/VEBA-classify_env/etc/conda/deactivate.d/'
/projects/navarro_lab/envs/VEBA-cluster_env
mkdir: created directory '/projects/navarro_lab/envs/VEBA-cluster_env/etc/conda'
mkdir: created directory '/projects/navarro_lab/envs/VEBA-cluster_env/etc/conda/activate.d/'
mkdir: created directory '/projects/navarro_lab/envs/VEBA-cluster_env/etc/conda/deactivate.d/'
/projects/navarro_lab/envs/VEBA-database_env
/projects/navarro_lab/envs/VEBA-mapping_env
/projects/navarro_lab/envs/VEBA-phylogeny_env
mkdir: created directory '/projects/navarro_lab/envs/VEBA-phylogeny_env/etc'
mkdir: created directory '/projects/navarro_lab/envs/VEBA-phylogeny_env/etc/conda'
mkdir: created directory '/projects/navarro_lab/envs/VEBA-phylogeny_env/etc/conda/activate.d/'
mkdir: created directory '/projects/navarro_lab/envs/VEBA-phylogeny_env/etc/conda/deactivate.d/'
/projects/navarro_lab/envs/VEBA-preprocess_env
. .. ... ..... ........ .............
xiii * Adding the following environment variable to VEBA environments: export CHECKM2DB=/scratch/jc4675/veba/install/Classify/CheckM2/uniref100.KO.1.dmnd
. .. ... ..... ........ .............
xiv * Adding the following environment variable to VEBA environments: export GTDBTK_DATA_PATH=/scratch/jc4675/veba/install/Classify/GTDB/
. .. ... ..... ........ .............
xv * Adding the following environment variable to VEBA environments: export CHECKVDB=/scratch/jc4675/veba/install/Classify/CheckV/
 _    _ _______ ______  _______
  \  /  |______ |_____] |_____|
   \/   |______ |_____] |     |
.........................................
  Environment Variable Update Complete
.........................................
The VEBA database environment variable is set in your VEBA conda environments:
    VEBA_DATABASE=/scratch/jc4675/veba/install

Now check_installation.sh shows that each element of veba has passed the checkup.

[Pass] VEBA-annotate_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-annotate_env
[Pass] VEBA-assembly_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-assembly_env
[Pass] VEBA-binning-eukaryotic_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-eukaryotic_env
[Pass] VEBA-binning-prokaryotic_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-prokaryotic_env
[Pass] VEBA-binning-viral_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-binning-viral_env
[Pass] VEBA-biosynthetic_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-biosynthetic_env
[Pass] VEBA-classify_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-classify_env
[Pass] VEBA-cluster_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-cluster_env
[Pass] VEBA-database_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-database_env
[Pass] VEBA-mapping_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-mapping_env
[Pass] VEBA-phylogeny_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-phylogeny_env
[Pass] VEBA-preprocess_env SUCCESSFULLY created.
[Pass] VEBA_DATABASE environment SUCCESSFULLY set in VEBA-preprocess_env
[Pass] CHECKM2DB environment SUCCESSFULLY set in VEBA-binning-prokaryotic_env
[Pass] GTDBTK_DATA_PATH environment SUCCESSFULLY set in VEBA-classify_env
[Pass] CHECKVDB environment SUCCESSFULLY set in VEBA-binning-viral_env
jolespin commented 11 months ago

This is great documentation! Thank you. I'll probably add some insight from this in the install guide. Closing the issue now but please feel free to reopen.