ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

Installing Anvio-8 on Rackham #23

Open ndreey opened 7 months ago

ndreey commented 7 months ago

Anvio-8 (CHECK LAST COMMENT FOR BEST INSTALL)

Anvio is not available as a module on Rackham and has to be manually installed in ones conda environment. The steps to install Anvio was gathered from:

Conda

As solving packages can be resourcessful i started an interactive session. interactive -A naiss2023-22-412 -p core -n 2 -t 01:30:00

First step is to set the $CONDA_ENVS_PATH in your .bashrc file.

# Added /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/.conda/envs
vim ~/.bashrc

# Reload .bashrc
source ~/.bashrc

With that set, we can continue load in conda.

module load bioinfo-tools
module load conda/latest

I created a .yaml file with the required packages.

anvio-8.yaml

name: anvio-8
channels:
 - bioconda
 - conda-forge
dependencies:
  - python=3.10
  - sqlite
  - prodigal
  - idba
  - mcl
  - muscle=3.8.1551
  - famsa
  - hmmer
  - diamond
  - blast
  - megahit
  - spades
  - bowtie2
  - bwa
  - graphviz
  - "samtools>=1.9"
  - trimal
  - iqtree
  - trnascan-se
  - fasttree
  - vmatch
  - r-base
  - r-tidyverse
  - r-optparse
  - r-stringi
  - r-magrittr
  - bioconductor-qvalue
  - meme
  - ghostscript
  - fastani

Now, we can set up the environment. As i am not 100% sure how well mamba works on Rackham, i decided to use conda even though it is slower.

This command will tell where conda should install the environment and which packages. conda env create --prefix $CONDA_ENVS_PATH/ --file doc/anvio-8.yaml

After a while it finally finished, although thanks to using --prefix the environment did not get a name.

conda env list
# conda environments:
#
                         /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/.conda/envs
base                  *  /sw/apps/conda/latest/rackham_stage

Now to install anvio.

>curl -L https://github.com/merenlab/anvio/releases/download/v8/anvio-8.tar.gz \
        --output anvio-8.tar.gz

>pip install anvio-8.tar.gz

And... SUCCESS !

>anvi-self-test --suite mini --no-interactive -T 2

Misc data reported for layers ................: default, t_domain, t_family, t_class, t_genus, t_phylum, t_order, t_species

Misc data reported for items .................: None

HTML Output ..................................: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/anvi_test/SAMPLES-MERGED-SUMMARY/index.html

* The self-test is done, and all the files anvi'o generated are stored in
  anvi_test/

However...

As mentioned before when setting up the environment with --prefix we dont get a name. Thus, giving this long environment name when activated. (/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/.conda/envs) andbou@r174: Andre: Furthermore, i should have specified --prefix $CONDA_ENVS_PATH/anvio as now anvio is installed in .conda/envs and not its own directory. Eitherway, this could be made shorter if i set the $CONDA_ENVS_PATH to earlier directory in the project. Or, if i were to install conda or mamba accordingly to: https://hackmd.io/@pmitev/conda_on_Rackham.

Also, when seeing what BINNERs exist, anvio cant find any, even though CONCOCT for example was specified in the environment.

>anvi-cluster-contigs -h

usage: anvi-cluster-contigs [-h] -p PROFILE_DB -c CONTIGS_DB -C COLLECTION_NAME --driver DRIVER [-T NUM_THREADS] [--log-file FILE_PATH] [--just-do-it]

A program to cluster items in a merged anvi'o profile using automatic binning algorithms

options:
  -h, --help            show this help message and exit
  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database
  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-contigs-database'
  -C COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name.
  --driver DRIVER       Automatic binning drivers. Available options 'concoct, metabat2, maxbin2, dastool, binsanity'.
  -T NUM_THREADS, --num-threads NUM_THREADS
                        Maximum number of threads to use for multithreading whenever possible. Very conservatively, the default is 1. It is a good idea to not exceed the number of CPUs / cores on your system. Plus, please be careful with this option if you are
                        running your commands on a SGE --if you are clusterizing your runs, and asking for multiple threads to use, you may deplete your resources very fast.
  --log-file FILE_PATH  File path to store debug/output messages.
  --just-do-it          Don't bother me with questions or warnings, just do it.

CONCOCT [NOT FOUND]

METABAT2 [NOT FOUND]

MAXBIN2 [NOT FOUND]

DASTOOL [NOT FOUND]

BINSANITY [NOT FOUND]

If i load module bioinfo-tools and load module CONCOCT/1.1.0 we now get this error.

>anvi-cluster-contigs -h

Traceback (most recent call last):
  File "/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/.conda/envs/bin/anvi-cluster-contigs", line 12, in <module>
    import anvio.dbops as dbops
  File "/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/.conda/envs/lib/python3.10/site-packages/anvio/dbops.py", line 38, in <module>
    import anvio.contigops as contigops
  File "/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/.conda/envs/lib/python3.10/site-packages/anvio/contigops.py", line 15, in <module>
    from Bio import SeqIO
  File "/sw/bioinfo/CONCOCT/1.1.0/rackham/lib/python3.7/site-packages/biopython-1.77-py3.7-linux-x86_64.egg/Bio/SeqIO/__init__.py", line 382, in <module>
    from Bio.Align import MultipleSeqAlignment
  File "/sw/bioinfo/CONCOCT/1.1.0/rackham/lib/python3.7/site-packages/biopython-1.77-py3.7-linux-x86_64.egg/Bio/Align/__init__.py", line 21, in <module>
    from Bio.Align import _aligners

Unloading CONCOCT resolves this error.

Databases

There was no issue downloading the databases and setting them up. I will have to generate DB directories for each of them.

ndreey commented 7 months ago

Following Pavlin's approach

Lets begin with removing the .conda/envs directory, variable and environment.

conda env remove --prefix /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/.conda/envs

I then create a bin in my working directory in the project.

I follow the steps: https://hackmd.io/@pmitev/conda_on_Rackham I choose to install miniforge3 here: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/miniforge3

After logging out, logging in, setting conda to initialize = false, i get these results!

andbou@rackham2: Andre: mamba activate

(base) andbou@rackham2: Andre: mamba doctor
Currently, only install, create, list, search, run, info, clean, remove, update, repoquery, activate and deactivate are supported through mamba.

(base) andbou@rackham2: Andre: mamba -V
mamba 1.5.7
conda 24.1.2

(base) andbou@rackham2: Andre: which mamba
/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/miniforge3/bin/mamba

(base) andbou@rackham2: Andre: which conda
/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/miniforge3/bin/conda

EUREKA, we have mamba installed!

Setting up Anvio-8 with mamba

I once again start a interactive session, activate mamba and then run: mamba env create --file CONURA_WGS/doc/anvio-8.yaml _Note: i am in my working directory and not the analysis directory CONURAWGS. The command took about 8-10min.

(base) andbou@r412: Andre: mamba env list
# conda environments:
#
base                  *  /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/miniforge3
anvio-8                  /home/andbou/.conda/envs/anvio-8

Then we download anvio-8.tar.gz and activate our new environment.

cd bin/

curl -L https://github.com/merenlab/anvio/releases/download/v8/anvio-8.tar.gz \
        --output anvio-8.tar.gz

mamba activate anvio-8

cd ..

pip install anvio-8.tar.gz

Now lets set up databases! I started an interactive session with 4 cores for this.

# Generates the specific db folders
mkdir -p databases/{scg,ncbi-cogs}

# Setting up NCBI COG
anvi-setup-ncbi-cogs --cog-data-dir databases/ncbi-cogs/ -T 4 --cog-version COG20 --reset

# Setting up SCG taxonomy database (removing old files)
anvi-setup-scg-taxonomy -T 4 --scgs-taxonomy-data-dir databases/scg/ --reset

As KEGG is a much larger database, i used this script. get-kegg.sh

!/bin/bash

#SBATCH --job-name anvio-contigdb
#SBATCH -A naiss2024-22-580
#SBATCH -p core -n 6
#SBATCH -t 06:30:00
#SBATCH --output=slurm-logs/anvio/SLURM-%j-setup-kegg.out
#SBATCH --error=slurm-logs/anvio/SLURM-%j-setup-kegg.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)     [Start]"

# Activate the environment
mamba activate anvio-8

anvi-setup-kegg-data \
    --mode all \
    --kegg-data-dir ../databases/kegg \
    -T 6 \
    --reset

# End time and date
echo "$(date)     [End]"