Open ndreey opened 5 days ago
Create a new directory (07-MAG
) and move into it.
Create a log directory to store .log
files.
Activate the `anvio-8´ mamba environment and run the script below. Adjust the parameters for your session / use.
bash ../scripts/mag-curation/reformat-fasta.sh \
-p all \
-t 12 \
-m all-hybrid &> logs/reformat-assembly-all.log
Usage: ../scripts/mag-curation/reformat-fasta.sh [-p pop] [-m metagenome] [-t num_threads]
-p pop Population name (required)
-m metagenome Metagenome assembly name (required)
-t num_threads Number of threads (required)
-l min_length Minimum contig length (optional, default: 2500)
This step took 13 minutes, CHST and COGE took about 2min each.
This will create the contig database that anvio requires and populate it with gene features.
As both all
, CHST
had a higher frequency of Ribosomal_S3_C hits, we specify that anvio should use this scg for taxonomy estimation.
bash ../scripts/mag-curation/make-contig-db.sh \
-t 12 \
-p COGE \
-m 00-FIXED-ASSEMBLY/COGE/COGE-contigs.fa \
-s Ribosomal_S3_C &> logs/make-contig-db-COGE.log
Creates an anvio contig database
Usage: ../scripts/mag-curation/make-contig-db.sh [-p pop] [-m metagenome] [-t num_threads]
-p pop Population name (required)
-m metagenome Metagenome assembly name (required)
-t num_threads Number of threads (required)
-s scg_name SCG name to use for taxonomy (optional)
This step took 13 minutes, CHST and COGE took about 5 min each.
In contrast to the contigs-db, an anvi’o single-profile-db stores sample-specific information about contigs. Profiling a BAM file with anvi’o using anvi-profile creates a single profile that reports properties for each contig in a single sample based on mapping results.
The profile will include:
NOTE: At the moment, if all
is not set, then each profile will be clustered. In further analysis we want to inspect all CH vs CO for example, the code must be updated. At the moment the script can handle all or single populations.
bash ../scripts/mag-curation/make-profile-db.sh \
-p COGE \
-t 12 \
-c 02-CONTIG-DB/COGE/COGE.db \
&> logs/make-profile-db-COGE.log
Creates an anvio profile database
Usage: ../scripts/mag-curation/make-profile-db.sh [-p pop] [-t num_threads] [-c contig_db] [-l min_len]
-p pop Population name (required)
-t num_threads Number of threads (required)
-c contig_db Contig database (required)
-l min_len Minimum length (optional, default: 2500)
This step took 6 minutes, CHST and COGE took ca 3min each
For the cases when multiple populations are used we need to merge the PROFILE.db
files together.
Here is the command i used.
anvi-merge \
-c 02-CONTIG-DB/all/all.db \
-S all \
-o 04-MERGED-PROFILES/all/ \
03-PROFILES/all/*/PROFILE.db \
&> logs/merge-all-profiles.log
The common approach is to use the binners: CONCOCT
, METABAT2
, and MAXBIN2
and then refine them with DASTOOL
. Thus, the trick here is to bin these outside of anvio and then import the refined bins to anvio. Furthermore, we can check the quality of these bins using CheckM2
and BUSCO
.
UPPMAX has these modules:
CONCOCT/1.1.0
which is the latest version.MetaBat/2.12.1
which is not the latest version but has the METABAT2
executable. The latest version is 2.15.2.MaxBin/2.2.7
which is MaxBin2 with the latest version.DASTOOL
and CheckM2
not available (will have to manually install).metaWRAP/1.3.2
which is a wrapper that includes the three binners however does not utilize CheckM2
for the quality control nor DASTOOL
for refinment.CONCOCT
has a detailed and well documented manual where MaxBin2
and MetaBat2
has lackluster documentation. Hence, i will use metaWRAP
to bin and then refine the bins with metaWRAP
s bin refinement module and control with CheckM2
. DAS TOOL
uses an aggregation method which increases completeness but incorporates contamination. metaWRAP
utilizes an hybrid version to get better results.
NOTE: This is a SLURM script, so one has to manually change the script to incorporate more resources. Otherwise, all that is needed is to add the POP argument.
EXAMPLE
sbatch scripts/mag-curation/metawrap-bin.sh CHST
The script:
#!/bin/bash
#SBATCH --job-name metaWRAP-CHST
#SBATCH -A naiss2024-22-580
#SBATCH -p node -n 1
#SBATCH -C mem256GB
#SBATCH -t 05:00:00
#SBATCH --output=slurm-logs/binning/SLURM-%j-metaWRAP-binning-CHST.out
#SBATCH --error=slurm-logs/binning/SLURM-%j-metaWRAP-binning-CHST.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Load in modules
module load bioinfo-tools
module load metaWRAP/1.3.2
# Start time and date
echo "$(date) ${POP} [Start]"
# Move to the anvio working directory
cd /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/07-MAG
# Paths and variables
POP=$1
NUM_THREADS=16
OUT_DIR=05-metaWRAP/${POP}
ASSEMBLY=00-FIXED-ASSEMBLY/${POP}/${POP}-contigs.fa
R1=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/${POP}_R1-clean.fq.gz
R2=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/${POP}_R2-clean.fq.gz
TMP_R1=${OUT_DIR}/${POP}-tmp-reads/${POP}_1.fastq
TMP_R2=${OUT_DIR}/${POP}-tmp-reads/${POP}_2.fastq
# Generate folder for metaWRAP
if [ ! -d "${OUT_DIR}" ]; then
mkdir -p ${OUT_DIR}
fi
# Generate a temp directory inside metawrap folder.
mkdir -p ${OUT_DIR}/${POP}-tmp-reads/
# metawrap requires *_1.fastq as format. Hence, tmp versions of files
zcat ${R1} > ${TMP_R1}
zcat ${R2} > ${TMP_R2}
metawrap binning \
-a ${ASSEMBLY} \
-o ${OUT_DIR} \
-t ${NUM_THREADS} \
-m 256 \
--metabat2 --maxbin2 --concoct \
${TMP_R1} ${TMP_R2}
# Remove the temporary files
rm -r ${OUT_DIR}/${POP}-tmp-reads/
# End time and date
echo "$(date) ${POP} [End]"
Structured standard
After multiple different approaches i am now confident in an approach. To create a set standard, i will redo the scripts so the same analysis and structure is withheld unregards which sample approach is chosen (all/CH/pop). The effort will be worth it to have more reproducible and ease of use analysis. As well for future approaches.
This revamp will only touch on the steps after the metagenome assembly. Meaning, the anvio steps and forward.
MAG curation
Refining bins is the most complex part of the MAG curation. Hence why the importance of anvio is a key aspect to be able to visually inspect the data.
Furthermore, as anvio is not installed on uppmax, the mamba environment has to be activated. Thus, multiple of the script below must be run in an interactive session as i could not figure out how to activate the mamba environment through
sbatch
.