ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

Anvio: the multi-omics tool #42

Open ndreey opened 1 month ago

ndreey commented 1 month ago

Following the installation of Anvio (#23) we can now run it by activating the mamba environment anvio-8.

Lets create a new directory for the anvio run that we will use as the working dir.

mkdir 07-ANVIO

Re-formatting the assembly fasta and mapping

To avoid complicated fasta headers we will use anvi-script-reformat-fasta to change the headers to a more standardized format that anvio can handle. This must be done before mapping the reads to the assembly. The script will then map and sort the .bam files for the new reformatted assemblies.

#!/bin/bash

#SBATCH --job-name anvio-alignment
#SBATCH --array=1-2
#SBATCH -A naiss2024-22-580
#SBATCH -p core -n 8
#SBATCH -t 00:30:00
#SBATCH --output=slurm-logs/anvio/reformat-alignment/SLURM-%j-ref-align-%a.out
#SBATCH --error=slurm-logs/anvio/reformat-alignment/SLURM-%j-ref-align-%a.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)     [Start]"

# Activate the environment
mamba activate anvio-8

# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}

# Assign POP based on JOBID
if [ "$JOBID" -eq 1 ]; then
    POP="CHST"
elif [ "$JOBID" -eq 2 ]; then
    POP="COGE"
else
    echo "Invalid JOBID: $JOBID"
    exit 1
fi

# Move to the anvio working directory
cd 07-ANVIO

# Generate folder for reformated fasta
FIX_DIR=00-FIXED-ASSEMBLY
if [ ! -d "${FIX_DIR}" ]; then
    mkdir -p ${FIX_DIR}
fi

# Reformat the contigs for CHST and COGE populations, keeping only 
# contigs equal or above 1kbp.
anvi-script-reformat-fasta \
    --output-file ${FIX_DIR}/${POP}-contigs.fa \
    --prefix ${POP}_ \
    --min-len 1000 \
    --simplify-names \
    ../06-ASSEMBLY/${POP}/contigs.fasta

# Create directory for bam files
ALIGN_DIR=01-ALIGNMENT
if [ ! -d "${ALIGN_DIR}" ]; then
    mkdir -p ${ALIGN_DIR}
fi

# Build bowtie2 index with POP as prefix
bowtie2-build --threads 8 ${FIX_DIR}/${POP}-contigs.fa ${FIX_DIR}/${POP} 

# R1 and R2 reads
READ_DIR="../05-CLEAN-MERGED"
R1=${READ_DIR}/${POP}_R1-clean.fq.gz
R2=${READ_DIR}/${POP}_R2-clean.fq.gz

# Map the clean reads to the reformatted contigs file.
# --no-unal suppresses SAM records for unaligned reads
bowtie2 \
    --threads 8 \
    -x ${FIX_DIR}/${POP} \
    -1 $R1 \
    -2 $R2 \
    --no-unal \
    -S ${ALIGN_DIR}/${POP}.sam

# Convert to bam and exclude all unmapped reads
samtools view \
    -@ 8 \
    -F 4 \
    -b \
    ${ALIGN_DIR}/${POP}.sam > ${ALIGN_DIR}/${POP}-RAW.bam

# Sort and creates index of BAM files
anvi-init-bam \
    -o ${ALIGN_DIR}/${POP}.bam \
    --num-threads 8 \
    ${ALIGN_DIR}/${POP}-RAW.bam

# Remove the .sam and raw bam file
rm ${ALIGN_DIR}/${POP}.sam ${ALIGN_DIR}/${POP}-RAW.bam

# End time and date
echo "$(date)       [End]"

CHST results

WHAT HAPPENED
===============================================
Contigs removed ..............................: 29,258 (94.36% of all)
Nucleotides removed ..........................: 11,360,541 (61.81% of all)
Nucleotides modified .........................: 0 (0.00000% of all)
Deflines simplified ..........................: True
Building a SMALL index
Renaming 00-FIXED-ASSEMBLY/CHST.3.bt2.tmp to 00-FIXED-ASSEMBLY/CHST.3.bt2
Renaming 00-FIXED-ASSEMBLY/CHST.4.bt2.tmp to 00-FIXED-ASSEMBLY/CHST.4.bt2
Renaming 00-FIXED-ASSEMBLY/CHST.1.bt2.tmp to 00-FIXED-ASSEMBLY/CHST.1.bt2
Renaming 00-FIXED-ASSEMBLY/CHST.2.bt2.tmp to 00-FIXED-ASSEMBLY/CHST.2.bt2
Renaming 00-FIXED-ASSEMBLY/CHST.rev.1.bt2.tmp to 00-FIXED-ASSEMBLY/CHST.rev.1.bt2
Renaming 00-FIXED-ASSEMBLY/CHST.rev.2.bt2.tmp to 00-FIXED-ASSEMBLY/CHST.rev.2.bt2
960604 reads; of these:
  960604 (100.00%) were paired; of these:
    414661 (43.17%) aligned concordantly 0 times
    492148 (51.23%) aligned concordantly exactly 1 time
    53795 (5.60%) aligned concordantly >1 times
    ----
    414661 pairs aligned concordantly 0 times; of these:
      47285 (11.40%) aligned discordantly 1 time
    ----
    367376 pairs aligned 0 times concordantly or discordantly; of these:
      734752 mates make up the pairs; of these:
        694024 (94.46%) aligned 0 times
        24109 (3.28%) aligned exactly 1 time
        16619 (2.26%) aligned >1 times
63.88% overall alignment rate

COGE results

WHAT HAPPENED
===============================================
Contigs removed ..............................: 147,623 (97.67% of all)
Nucleotides removed ..........................: 48,096,629 (87.01% of all)
Nucleotides modified .........................: 0 (0.00000% of all)
Deflines simplified ..........................: True
Building a SMALL index
Renaming 00-FIXED-ASSEMBLY/COGE.3.bt2.tmp to 00-FIXED-ASSEMBLY/COGE.3.bt2
Renaming 00-FIXED-ASSEMBLY/COGE.4.bt2.tmp to 00-FIXED-ASSEMBLY/COGE.4.bt2
Renaming 00-FIXED-ASSEMBLY/COGE.1.bt2.tmp to 00-FIXED-ASSEMBLY/COGE.1.bt2
Renaming 00-FIXED-ASSEMBLY/COGE.2.bt2.tmp to 00-FIXED-ASSEMBLY/COGE.2.bt2
Renaming 00-FIXED-ASSEMBLY/COGE.rev.1.bt2.tmp to 00-FIXED-ASSEMBLY/COGE.rev.1.bt2
Renaming 00-FIXED-ASSEMBLY/COGE.rev.2.bt2.tmp to 00-FIXED-ASSEMBLY/COGE.rev.2.bt2
1951469 reads; of these:
  1951469 (100.00%) were paired; of these:
    1203231 (61.66%) aligned concordantly 0 times
    703852 (36.07%) aligned concordantly exactly 1 time
    44386 (2.27%) aligned concordantly >1 times
    ----
    1203231 pairs aligned concordantly 0 times; of these:
      84367 (7.01%) aligned discordantly 1 time
    ----
    1118864 pairs aligned 0 times concordantly or discordantly; of these:
      2237728 mates make up the pairs; of these:
        2172026 (97.06%) aligned 0 times
        55280 (2.47%) aligned exactly 1 time
        10422 (0.47%) aligned >1 times
44.35% overall alignment rate
ndreey commented 1 month ago

Creating an anvio contigs database

An anvio contigs database will keep all the information related to your contigs: positions of open reading frames, k-mer frequencies for each contigs, where splits start and end, functional and taxonomic annotation of genes, etc. The contigs database is an essential component of everything related to anvi’o metagenomic workflow. The anvi-gen-contigs-database command will:

Furthermore we will look for SCG using hidden makarov models, get functional annotation using NCBI-COGS as well as getting taxonomy for the contigs. I used this script: contig-db.sh

#!/bin/bash

#SBATCH --job-name anvio-contigdb
#SBATCH --array=1-2
#SBATCH -A naiss2024-22-580
#SBATCH -p core -n 12
#SBATCH -t 01:30:00
#SBATCH --output=slurm-logs/anvio/contig-db/SLURM-%j-contigdb-%a.out
#SBATCH --error=slurm-logs/anvio/contig-db/SLURM-%j-contigdb-%a.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)     [Start]"

# Activate the environment
mamba activate anvio-8

# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}

# Assign POP based on JOBID
if [ "$JOBID" -eq 1 ]; then
    POP="CHST"
elif [ "$JOBID" -eq 2 ]; then
    POP="COGE"
else
    echo "Invalid JOBID: $JOBID"
    exit 1
fi

# Paths and variables
FIX_DIR=00-FIXED-ASSEMBLY
DB_DIR=02-CONTIG-DB
CDB=${DB_DIR}/${POP}.db
NUM_THREADS=12
COG_DIR=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/databases/ncbi-cogs
SCG_DIR=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/databases/scg/

# Move to the anvio working directory
cd 07-ANVIO

# Generate folder for contig database
if [ ! -d "${DB_DIR}" ]; then
    mkdir -p ${DB_DIR}
fi

# Generate the anvio contig database
anvi-gen-contigs-database \
    --contigs-fasta ${FIX_DIR}/${POP}-contigs.fa \
    --project-name Tconura_${POP} \
    --num-threads ${NUM_THREADS} \
    --output-db-path ${DB_DIR}/${POP}.db \
    --description ../doc/short-project-desc.txt

# Run hidden makarov models.
anvi-run-hmms \
    --contigs-db ${CDB} \
    --also-scan-trnas \
    --num-threads ${NUM_THREADS}

# Annotate the genes with functions from COG database
anvi-run-ncbi-cogs \
    --contigs-db ${CDB} \
    --cog-version COG20 \
    --cog-data-dir ${COG_DIR} \
    --num-threads ${NUM_THREADS}

# Classify the taxa for the database
anvi-run-scg-taxonomy \
    --contigs-db ${CDB} \
    --scgs-taxonomy-data-dir ${SCG_DIR} \
    --num-threads ${NUM_THREADS}

# Estimate the taxonomy
anvi-estimate-scg-taxonomy \
    --contigs-db ${CDB} \
    --metagenome-mode \
    --num-threads ${NUM_THREADS}

# End time and date
echo "$(date)  ${POP}    [End]"
ndreey commented 1 month ago

Generating an anvio profile

In contrast to the contigs-db, an anvi’o single-profile-db stores sample-specific information about contigs. Profiling a BAM file with anvi’o using anvi-profile creates a single profile that reports properties for each contig in a single sample based on mapping results.

The profile will include:

Where we can then add the binning results.

profile-db.sh

#!/bin/bash

#SBATCH --job-name anvio-profile
#SBATCH --array=1-2
#SBATCH -A naiss2024-22-580
#SBATCH -p core -n 12
#SBATCH -t 01:30:00
#SBATCH --output=slurm-logs/anvio/profile-db/SLURM-%j-profile-%a.out
#SBATCH --error=slurm-logs/anvio/profile-db/SLURM-%j-profile-%a.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Activate the environment
mamba activate anvio-8

# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}

# Assign POP based on JOBID
if [ "$JOBID" -eq 1 ]; then
    POP="CHST"
elif [ "$JOBID" -eq 2 ]; then
    POP="COGE"
else
    echo "Invalid JOBID: $JOBID"
    exit 1
fi

# Start time and date
echo "$(date)       ${POP}     [Start]"

# Move to the anvio working directory
cd 07-ANVIO

# Paths and variables
PROF_DB=03-PROFILES/${POP}
BAM_IN=01-ALIGNMENT/${POP}.bam
CDB=02-CONTIG-DB/${POP}.db
DESC=../doc/short-project-desc.txt
MIN_LEN=1500
NUM_THREADS=12

anvi-profile \
    --input-file ${BAM_IN} \
    --contigs-db ${CDB} \
    --sample-name ${POP} \
    --description ${DESC} \
    --min-contig-length ${MIN_LEN} \
    --output-dir ${PROF_DB} \
    --num-threads ${NUM_THREADS}

# End time and date
echo "$(date)       ${POP}      [End]"

Binning

The common approach is to use the binners: CONCOCT, METABAT2, and MAXBIN2 and then refine them with DASTOOL. Thus, the trick here is to bin these outside of anvio and then import the refined bins to anvio. Furthermore, we can check the quality of these bins using CheckM2 and BUSCO.

UPPMAX has these modules:

CONCOCT has a detailed and well documented manual where MaxBin2 and MetaBat2 has lackluster documentation. Hence, i will use nf-core/mag and metaWRAP to bin and then manually refine the bins with DASTOOL and CheckM2, as well to compare with metaWRAP's refinment module.

NOTEWORTHY: The workflow will discard all contigs below 1500bp (MaxBin2, default). Furthermore, metaWRAP and nf-core/mag hand unexpected errors: -nf-core/mag: pigz: skipping: CHST-contigs.fa is a symbolic link.

nf-core/mag

Lets create output directories and a working directory. The steps below is focused for CHST, to do COGE just replace everything that is CHST.

# Create binning directories for output.
mkdir -p 07-ANVIO/nf-core-mag-work/{CHST,COGE}
mkdir -p 07-ANVIO/04-BINNING/{CHST,COGE}

# Go to the nf-core work directory for the population you will run
cd 07-ANVIO/nf-core-mag-work/CHST

# Activate the nextflow nf-core environment
mamba activate nf-core

# Start the workflow
nextflow run nf-core/mag -r 3.0.0 -bg -profile uppmax -params-fil
e ../../doc/anvio-nf-core-binning/chst-bin-params.yml --project naiss2024-22-580 > chst-bg.log

The following files exists for both CHST and COGE. Showing only CHST.

chst-bin-params.yml

input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/anvio-nf-core-binning/chst-samplesheet-reads.csv
assembly_input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/anvio-nf-core-binning/chst-samplesheet-assembly.csv
outdir: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/07-ANVIO/04-BINNING/CHST
email: andbou95@gmail.com
multiqc_title: CHST-BINNING
skip_binqc: true
skip_prodigal: true
skip_prokka: true
skip_metaeuk: true

chst-samplesheet-reads.csv

sample,group,short_reads_1,short_reads_2,long_reads
CHST,0,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/CHST_R1-clean.fq.gz,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/CHST_R2-clean.fq.gz

chst-samplesheet-assembly.csv

id,group,assembler,fasta
CHST,0,SPAdesHybrid,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/07-ANVIO/00-FIXED-ASSEMBLY/CHST-contigs.fa

These files can be found in CONURA_WGS/doc/anvio-nf-core-binning/.

metaWRAP

Run the script metaWRAP-bin.sh. What is special is that it generates temporary versions of the .fastq files so that it can handle them as input.

#!/bin/bash

#SBATCH --job-name metaWRAP
#SBATCH --array=1-2
#SBATCH -A naiss2024-22-580
#SBATCH -p node -n 1
#SBATCH -C mem256GB 
#SBATCH -t 8:00:00
#SBATCH --output=slurm-logs/binning/SLURM-%j-metaWRAP-binning-%a.out
#SBATCH --error=slurm-logs/binning/SLURM-%j-metaWRAP-binning-%a.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Load in modules
module load bioinfo-tools
module load metaWRAP/1.3.2

# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}

# Assign POP based on JOBID
if [ "$JOBID" -eq 1 ]; then
    POP="CHST"
elif [ "$JOBID" -eq 2 ]; then
    POP="COGE"
else
    echo "Invalid JOBID: $JOBID"
    exit 1
fi

# Start time and date
echo "$(date)       ${POP}     [Start]"

# Move to Anvio work
cd 07-ANVIO

# Paths and variables
NUM_THREADS=16
OUT_DIR=05-metaWRAP/${POP}
ASSEMBLY=00-FIXED-ASSEMBLY/${POP}-contigs.fa
R1=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/${POP}_R1-clean.fq.gz
R2=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/${POP}_R2-clean.fq.gz
TMP_R1=${OUT_DIR}/tmp-reads/${POP}_1.fastq
TMP_R2=${OUT_DIR}/tmp-reads/${POP}_2.fastq

# Generate folder for metaWRAP + tmp-reads
if [ ! -d "${OUT_DIR}" ]; then
    mkdir -p ${OUT_DIR}
fi
mkdir -p ${OUT_DIR}/tmp-reads/

# metawrap requires *_1.fastq as format. Hence, tmp versions of files
zcat ${R1} > ${TMP_R1}
zcat ${R2} > ${TMP_R2}

metawrap binning \
    -a ${ASSEMBLY} \
    -o ${OUT_DIR} \
    -t ${NUM_THREADS} \
    -m 256 \
    --metabat2 --maxbin2 --concoct \
    ${TMP_R1} ${TMP_R2}

# Remove the temporary files
rm -r ${OUT_DIR}/tmp-reads/

# End time and date
echo "$(date)  ${POP}    [End]"

Results

WE get the same results for MaxBin2 and MetaBAT2 for the different populations. The only difference is found with CONCOCT as the nf-core/mag workflow generated 100+ more bins compared to metaWRAP. program CONCOCT MaxBin2 MetaBAT2 POP
nf-core/mag 192 3 3 CHST
nf-core/mag 195 2 3 COGE
metaWRAP 40 3 3 CHST
metaWRAP 35 2 3 COGE

Bin refinement

Quality can be assesed with CheckM2 and BUSCO. The reason for not using CheckM is that it relies heavily on marker gene sets from well-studied organisms. This can be erroneous for our case as we are interested in the facultative and obligate symbionts which have reduced genomes where these universal marker genes might be lost. CheckM2 utilizes machine learning to get a wider analysis to assert the quality of the MAG. Thus, we will use BUSCO as a compliment to strenghten the QC for known symbionts and CheckM2 for Stammerula.

Lets install DASTOOL and CheckM2 so it is out of the way. BUSCO already exists as a module.

mamba create -n dastool das_tool
mamba create -n checkm2 checkm2

Sending scripts to SLURM has been troublesome as sometimes mamba activate <env> works and sometimes it does not. Hence, an interactive session was started and this script was run bash checkm2-check.sh <population>.

#!/bin/bash

# Takes the population input
POP=$1

# Start time and date
echo "$(date)       ${POP}     [Start]"

# Move to anvio working dir
cd /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/07-ANVIO

# Paths and variables
NF_DIR=04-BINNING
WRAP_DIR=05-metaWRAP
NUM_THREADS=12
CHECKM2_DB=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/databases/checkm2-db/CheckM2_database/uniref100.KO.1.dmnd

for BINNER in CONCOCT MaxBin2 MetaBAT2; do
    # Paths
    BIN_DIR=${NF_DIR}/${POP}/GenomeBinning/${BINNER}/bins
    OUT_DIR=07-CHECKM2/${POP}/nf-core/${BINNER}
    # Create the checkm2 directory
    if [ ! -d "${OUT_DIR}" ]; then
        mkdir -p ${OUT_DIR}
    fi

    checkm2 predict \
        --input ${BIN_DIR} \
        --output-directory ${OUT_DIR} \
        --threads ${NUM_THREADS} \
        --allmodels \
        --extension gz \
        --database_path ${CHECKM2_DB}

done

for BINNER in concoct_bins maxbin2_bins metabat2_bins; do
    BIN_DIR=${WRAP_DIR}/${POP}/${BINNER}
    OUT_DIR=07-CHECKM2/${POP}/metaWRAP/${BINNER}
    # Create the checkm2 directory
    if [ ! -d "${OUT_DIR}" ]; then
        mkdir -p ${OUT_DIR}
    fi    

    checkm2 predict \
        --input ${BIN_DIR} \
        --output-directory ${OUT_DIR} \
        --threads ${NUM_THREADS} \
        --allmodels \
        --extension fa \
        --database_path ${CHECKM2_DB}
done
# End time and date
echo "$(date)       ${POP}      [End]"

The results are shown below:

TOOL BINNER BIN COMP. CONT. CODING_DENSITY N50 Avg.Gene.Len Genome_Size GC CDS
metaWRAP MaxBin2 bin.0 99% 0.99 0.756 145436 250 2239073 0.44 2264
metaWRAP MaxBin2 bin.1 98% 2.75 0.517 1996 180 2676325 0.37 2567
metaWRAP MaxBin2 bin.2 40% 3.19 0.803 15272 206 2091336 0.54 2733

to be continued...