Open ndreey opened 1 month ago
An anvio contigs database will keep all the information related to your contigs: positions of open reading frames, k-mer frequencies for each contigs, where splits start and end, functional and taxonomic annotation of genes, etc. The contigs database is an essential component of everything related to anvi’o metagenomic workflow.
The anvi-gen-contigs-database
command will:
--kmer-size
.Prodigal
. Furthermore we will look for SCG using hidden makarov models, get functional annotation using NCBI-COGS as well as getting taxonomy for the contigs. I used this script: contig-db.sh
#!/bin/bash
#SBATCH --job-name anvio-contigdb
#SBATCH --array=1-2
#SBATCH -A naiss2024-22-580
#SBATCH -p core -n 12
#SBATCH -t 01:30:00
#SBATCH --output=slurm-logs/anvio/contig-db/SLURM-%j-contigdb-%a.out
#SBATCH --error=slurm-logs/anvio/contig-db/SLURM-%j-contigdb-%a.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Start time and date
echo "$(date) [Start]"
# Activate the environment
mamba activate anvio-8
# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}
# Assign POP based on JOBID
if [ "$JOBID" -eq 1 ]; then
POP="CHST"
elif [ "$JOBID" -eq 2 ]; then
POP="COGE"
else
echo "Invalid JOBID: $JOBID"
exit 1
fi
# Paths and variables
FIX_DIR=00-FIXED-ASSEMBLY
DB_DIR=02-CONTIG-DB
CDB=${DB_DIR}/${POP}.db
NUM_THREADS=12
COG_DIR=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/databases/ncbi-cogs
SCG_DIR=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/databases/scg/
# Move to the anvio working directory
cd 07-ANVIO
# Generate folder for contig database
if [ ! -d "${DB_DIR}" ]; then
mkdir -p ${DB_DIR}
fi
# Generate the anvio contig database
anvi-gen-contigs-database \
--contigs-fasta ${FIX_DIR}/${POP}-contigs.fa \
--project-name Tconura_${POP} \
--num-threads ${NUM_THREADS} \
--output-db-path ${DB_DIR}/${POP}.db \
--description ../doc/short-project-desc.txt
# Run hidden makarov models.
anvi-run-hmms \
--contigs-db ${CDB} \
--also-scan-trnas \
--num-threads ${NUM_THREADS}
# Annotate the genes with functions from COG database
anvi-run-ncbi-cogs \
--contigs-db ${CDB} \
--cog-version COG20 \
--cog-data-dir ${COG_DIR} \
--num-threads ${NUM_THREADS}
# Classify the taxa for the database
anvi-run-scg-taxonomy \
--contigs-db ${CDB} \
--scgs-taxonomy-data-dir ${SCG_DIR} \
--num-threads ${NUM_THREADS}
# Estimate the taxonomy
anvi-estimate-scg-taxonomy \
--contigs-db ${CDB} \
--metagenome-mode \
--num-threads ${NUM_THREADS}
# End time and date
echo "$(date) ${POP} [End]"
In contrast to the contigs-db, an anvi’o single-profile-db stores sample-specific information about contigs. Profiling a BAM file with anvi’o using anvi-profile creates a single profile that reports properties for each contig in a single sample based on mapping results.
The profile will include:
Where we can then add the binning results.
profile-db.sh
#!/bin/bash
#SBATCH --job-name anvio-profile
#SBATCH --array=1-2
#SBATCH -A naiss2024-22-580
#SBATCH -p core -n 12
#SBATCH -t 01:30:00
#SBATCH --output=slurm-logs/anvio/profile-db/SLURM-%j-profile-%a.out
#SBATCH --error=slurm-logs/anvio/profile-db/SLURM-%j-profile-%a.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Activate the environment
mamba activate anvio-8
# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}
# Assign POP based on JOBID
if [ "$JOBID" -eq 1 ]; then
POP="CHST"
elif [ "$JOBID" -eq 2 ]; then
POP="COGE"
else
echo "Invalid JOBID: $JOBID"
exit 1
fi
# Start time and date
echo "$(date) ${POP} [Start]"
# Move to the anvio working directory
cd 07-ANVIO
# Paths and variables
PROF_DB=03-PROFILES/${POP}
BAM_IN=01-ALIGNMENT/${POP}.bam
CDB=02-CONTIG-DB/${POP}.db
DESC=../doc/short-project-desc.txt
MIN_LEN=1500
NUM_THREADS=12
anvi-profile \
--input-file ${BAM_IN} \
--contigs-db ${CDB} \
--sample-name ${POP} \
--description ${DESC} \
--min-contig-length ${MIN_LEN} \
--output-dir ${PROF_DB} \
--num-threads ${NUM_THREADS}
# End time and date
echo "$(date) ${POP} [End]"
The common approach is to use the binners: CONCOCT
, METABAT2
, and MAXBIN2
and then refine them with DASTOOL
. Thus, the trick here is to bin these outside of anvio and then import the refined bins to anvio. Furthermore, we can check the quality of these bins using CheckM2
and BUSCO
.
UPPMAX has these modules:
CONCOCT/1.1.0
which is the latest version.MetaBat/2.12.1
which is not the latest version but has the METABAT2
executable. The latest version is 2.15.2.MaxBin/2.2.7
which is MaxBin2 with the latest version.DASTOOL
and CheckM2
not available (will have to manually install).metaWRAP/1.3.2
which is a wrapper that includes the three binners however does not utilize CheckM2
for the quality control nor DASTOOL
for refinment.CONCOCT
has a detailed and well documented manual where MaxBin2
and MetaBat2
has lackluster documentation. Hence, i will use nf-core/mag
and metaWRAP
to bin and then manually refine the bins with DASTOOL
and CheckM2
, as well to compare with metaWRAP
's refinment module.
NOTEWORTHY: The workflow will discard all contigs below 1500bp (MaxBin2, default). Furthermore, metaWRAP
and nf-core/mag
hand unexpected errors:
-nf-core/mag
: pigz: skipping: CHST-contigs.fa is a symbolic link
.
metaWRAP
: Can only handle *_1.fastq
format, hence i needed to rename the files.Lets create output directories and a working directory. The steps below is focused for CHST, to do COGE just replace everything that is CHST.
# Create binning directories for output.
mkdir -p 07-ANVIO/nf-core-mag-work/{CHST,COGE}
mkdir -p 07-ANVIO/04-BINNING/{CHST,COGE}
# Go to the nf-core work directory for the population you will run
cd 07-ANVIO/nf-core-mag-work/CHST
# Activate the nextflow nf-core environment
mamba activate nf-core
# Start the workflow
nextflow run nf-core/mag -r 3.0.0 -bg -profile uppmax -params-fil
e ../../doc/anvio-nf-core-binning/chst-bin-params.yml --project naiss2024-22-580 > chst-bg.log
The following files exists for both CHST and COGE. Showing only CHST.
chst-bin-params.yml
input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/anvio-nf-core-binning/chst-samplesheet-reads.csv
assembly_input: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/doc/anvio-nf-core-binning/chst-samplesheet-assembly.csv
outdir: /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/07-ANVIO/04-BINNING/CHST
email: andbou95@gmail.com
multiqc_title: CHST-BINNING
skip_binqc: true
skip_prodigal: true
skip_prokka: true
skip_metaeuk: true
chst-samplesheet-reads.csv
sample,group,short_reads_1,short_reads_2,long_reads
CHST,0,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/CHST_R1-clean.fq.gz,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/CHST_R2-clean.fq.gz
chst-samplesheet-assembly.csv
id,group,assembler,fasta
CHST,0,SPAdesHybrid,/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/07-ANVIO/00-FIXED-ASSEMBLY/CHST-contigs.fa
These files can be found in CONURA_WGS/doc/anvio-nf-core-binning/
.
Run the script metaWRAP-bin.sh
. What is special is that it generates temporary versions of the .fastq
files so that it can handle them as input.
#!/bin/bash
#SBATCH --job-name metaWRAP
#SBATCH --array=1-2
#SBATCH -A naiss2024-22-580
#SBATCH -p node -n 1
#SBATCH -C mem256GB
#SBATCH -t 8:00:00
#SBATCH --output=slurm-logs/binning/SLURM-%j-metaWRAP-binning-%a.out
#SBATCH --error=slurm-logs/binning/SLURM-%j-metaWRAP-binning-%a.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Load in modules
module load bioinfo-tools
module load metaWRAP/1.3.2
# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}
# Assign POP based on JOBID
if [ "$JOBID" -eq 1 ]; then
POP="CHST"
elif [ "$JOBID" -eq 2 ]; then
POP="COGE"
else
echo "Invalid JOBID: $JOBID"
exit 1
fi
# Start time and date
echo "$(date) ${POP} [Start]"
# Move to Anvio work
cd 07-ANVIO
# Paths and variables
NUM_THREADS=16
OUT_DIR=05-metaWRAP/${POP}
ASSEMBLY=00-FIXED-ASSEMBLY/${POP}-contigs.fa
R1=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/${POP}_R1-clean.fq.gz
R2=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/${POP}_R2-clean.fq.gz
TMP_R1=${OUT_DIR}/tmp-reads/${POP}_1.fastq
TMP_R2=${OUT_DIR}/tmp-reads/${POP}_2.fastq
# Generate folder for metaWRAP + tmp-reads
if [ ! -d "${OUT_DIR}" ]; then
mkdir -p ${OUT_DIR}
fi
mkdir -p ${OUT_DIR}/tmp-reads/
# metawrap requires *_1.fastq as format. Hence, tmp versions of files
zcat ${R1} > ${TMP_R1}
zcat ${R2} > ${TMP_R2}
metawrap binning \
-a ${ASSEMBLY} \
-o ${OUT_DIR} \
-t ${NUM_THREADS} \
-m 256 \
--metabat2 --maxbin2 --concoct \
${TMP_R1} ${TMP_R2}
# Remove the temporary files
rm -r ${OUT_DIR}/tmp-reads/
# End time and date
echo "$(date) ${POP} [End]"
WE get the same results for MaxBin2 and MetaBAT2 for the different populations. The only difference is found with CONCOCT as the nf-core/mag workflow generated 100+ more bins compared to metaWRAP . |
program | CONCOCT | MaxBin2 | MetaBAT2 | POP |
---|---|---|---|---|---|
nf-core/mag | 192 | 3 | 3 | CHST | |
nf-core/mag | 195 | 2 | 3 | COGE | |
metaWRAP | 40 | 3 | 3 | CHST | |
metaWRAP | 35 | 2 | 3 | COGE |
Quality can be assesed with CheckM2
and BUSCO
. The reason for not using CheckM
is that it relies heavily on marker gene sets from well-studied organisms. This can be erroneous for our case as we are interested in the facultative and obligate symbionts which have reduced genomes where these universal marker genes might be lost. CheckM2
utilizes machine learning to get a wider analysis to assert the quality of the MAG. Thus, we will use BUSCO
as a compliment to strenghten the QC for known symbionts and CheckM2
for Stammerula.
Lets install DASTOOL
and CheckM2
so it is out of the way. BUSCO
already exists as a module.
mamba create -n dastool das_tool
mamba create -n checkm2 checkm2
Sending scripts to SLURM has been troublesome as sometimes mamba activate <env>
works and sometimes it does not. Hence, an interactive session was started and this script was run bash checkm2-check.sh <population>
.
#!/bin/bash
# Takes the population input
POP=$1
# Start time and date
echo "$(date) ${POP} [Start]"
# Move to anvio working dir
cd /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/07-ANVIO
# Paths and variables
NF_DIR=04-BINNING
WRAP_DIR=05-metaWRAP
NUM_THREADS=12
CHECKM2_DB=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/databases/checkm2-db/CheckM2_database/uniref100.KO.1.dmnd
for BINNER in CONCOCT MaxBin2 MetaBAT2; do
# Paths
BIN_DIR=${NF_DIR}/${POP}/GenomeBinning/${BINNER}/bins
OUT_DIR=07-CHECKM2/${POP}/nf-core/${BINNER}
# Create the checkm2 directory
if [ ! -d "${OUT_DIR}" ]; then
mkdir -p ${OUT_DIR}
fi
checkm2 predict \
--input ${BIN_DIR} \
--output-directory ${OUT_DIR} \
--threads ${NUM_THREADS} \
--allmodels \
--extension gz \
--database_path ${CHECKM2_DB}
done
for BINNER in concoct_bins maxbin2_bins metabat2_bins; do
BIN_DIR=${WRAP_DIR}/${POP}/${BINNER}
OUT_DIR=07-CHECKM2/${POP}/metaWRAP/${BINNER}
# Create the checkm2 directory
if [ ! -d "${OUT_DIR}" ]; then
mkdir -p ${OUT_DIR}
fi
checkm2 predict \
--input ${BIN_DIR} \
--output-directory ${OUT_DIR} \
--threads ${NUM_THREADS} \
--allmodels \
--extension fa \
--database_path ${CHECKM2_DB}
done
# End time and date
echo "$(date) ${POP} [End]"
The results are shown below:
TOOL | BINNER | BIN | COMP. | CONT. | CODING_DENSITY | N50 | Avg.Gene.Len | Genome_Size | GC | CDS |
---|---|---|---|---|---|---|---|---|---|---|
metaWRAP | MaxBin2 | bin.0 | 99% | 0.99 | 0.756 | 145436 | 250 | 2239073 | 0.44 | 2264 |
metaWRAP | MaxBin2 | bin.1 | 98% | 2.75 | 0.517 | 1996 | 180 | 2676325 | 0.37 | 2567 |
metaWRAP | MaxBin2 | bin.2 | 40% | 3.19 | 0.803 | 15272 | 206 | 2091336 | 0.54 | 2733 |
to be continued...
Following the installation of Anvio (#23) we can now run it by activating the mamba environment
anvio-8
.Lets create a new directory for the anvio run that we will use as the working dir.
Re-formatting the assembly fasta and mapping
To avoid complicated fasta headers we will use
anvi-script-reformat-fasta
to change the headers to a more standardized format that anvio can handle. This must be done before mapping the reads to the assembly. The script will then map and sort the.bam
files for the new reformatted assemblies.CHST results
COGE results