ndreey commented 5 days ago

Structured standard

After multiple different approaches i am now confident in an approach. To create a set standard, i will redo the scripts so the same analysis and structure is withheld unregards which sample approach is chosen (all/CH/pop). The effort will be worth it to have more reproducible and ease of use analysis. As well for future approaches.

This revamp will only touch on the steps after the metagenome assembly. Meaning, the anvio steps and forward.

MAG curation

Refining bins is the most complex part of the MAG curation. Hence why the importance of anvio is a key aspect to be able to visually inspect the data.

Furthermore, as anvio is not installed on uppmax, the mamba environment has to be activated. Thus, multiple of the script below must be run in an interactive session as i could not figure out how to activate the mamba environment through sbatch.

ndreey commented 5 days ago

ANVIO

Step 1: Reformat metagenome assembly

Create a new directory (07-MAG) and move into it. Create a log directory to store .log files. Activate the `anvio-8´ mamba environment and run the script below. Adjust the parameters for your session / use.

bash ../scripts/mag-curation/reformat-fasta.sh \
    -p all \
    -t 12 \
    -m all-hybrid &> logs/reformat-assembly-all.log

Usage: ../scripts/mag-curation/reformat-fasta.sh [-p pop] [-m metagenome] [-t num_threads]
  -p pop            Population name (required)
  -m metagenome     Metagenome assembly name (required)
  -t num_threads    Number of threads (required)
  -l min_length     Minimum contig length (optional, default: 2500)

This step took 13 minutes, CHST and COGE took about 2min each.

Step 2: Generate contig database

This will create the contig database that anvio requires and populate it with gene features.

As both all, CHST had a higher frequency of Ribosomal_S3_C hits, we specify that anvio should use this scg for taxonomy estimation.

bash ../scripts/mag-curation/make-contig-db.sh \
    -t 12 \
    -p COGE \
    -m 00-FIXED-ASSEMBLY/COGE/COGE-contigs.fa \
    -s Ribosomal_S3_C &> logs/make-contig-db-COGE.log

Creates an anvio contig database
Usage: ../scripts/mag-curation/make-contig-db.sh [-p pop] [-m metagenome] [-t num_threads]
  -p pop            Population name (required)
  -m metagenome     Metagenome assembly name (required)
  -t num_threads    Number of threads (required)
  -s scg_name       SCG name to use for taxonomy (optional)

This step took 13 minutes, CHST and COGE took about 5 min each.

Step 3: Generate a profile database

In contrast to the contigs-db, an anvi’o single-profile-db stores sample-specific information about contigs. Profiling a BAM file with anvi’o using anvi-profile creates a single profile that reports properties for each contig in a single sample based on mapping results.

The profile will include:

Coverage statistics
SNVs Where we can then add the binning results.

NOTE: At the moment, if all is not set, then each profile will be clustered. In further analysis we want to inspect all CH vs CO for example, the code must be updated. At the moment the script can handle all or single populations.

bash ../scripts/mag-curation/make-profile-db.sh \
    -p COGE \
    -t 12 \
    -c 02-CONTIG-DB/COGE/COGE.db \
    &> logs/make-profile-db-COGE.log

Creates an anvio profile database
Usage: ../scripts/mag-curation/make-profile-db.sh [-p pop] [-t num_threads] [-c contig_db] [-l min_len]
  -p pop            Population name (required)
  -t num_threads    Number of threads (required)
  -c contig_db      Contig database (required)
  -l min_len        Minimum length (optional, default: 2500)

This step took 6 minutes, CHST and COGE took ca 3min each

Step 4: Merging profiles

For the cases when multiple populations are used we need to merge the PROFILE.db files together.

Here is the command i used.

anvi-merge \
    -c 02-CONTIG-DB/all/all.db \
    -S all \
    -o 04-MERGED-PROFILES/all/ \
    03-PROFILES/all/*/PROFILE.db \
    &> logs/merge-all-profiles.log

ndreey commented 4 days ago

metaWRAP

Binning

The common approach is to use the binners: CONCOCT, METABAT2, and MAXBIN2 and then refine them with DASTOOL. Thus, the trick here is to bin these outside of anvio and then import the refined bins to anvio. Furthermore, we can check the quality of these bins using CheckM2 and BUSCO.

UPPMAX has these modules:

CONCOCT/1.1.0 which is the latest version.
MetaBat/2.12.1 which is not the latest version but has the METABAT2 executable. The latest version is 2.15.2.
MaxBin/2.2.7 which is MaxBin2 with the latest version.
DASTOOL and CheckM2 not available (will have to manually install).
metaWRAP/1.3.2 which is a wrapper that includes the three binners however does not utilize CheckM2 for the quality control nor DASTOOL for refinment.

CONCOCT has a detailed and well documented manual where MaxBin2 and MetaBat2 has lackluster documentation. Hence, i will use metaWRAP to bin and then refine the bins with metaWRAPs bin refinement module and control with CheckM2. DAS TOOL uses an aggregation method which increases completeness but incorporates contamination. metaWRAP utilizes an hybrid version to get better results.

NOTE: This is a SLURM script, so one has to manually change the script to incorporate more resources. Otherwise, all that is needed is to add the POP argument.

EXAMPLE

sbatch scripts/mag-curation/metawrap-bin.sh CHST

The script:

#!/bin/bash

#SBATCH --job-name metaWRAP-CHST
#SBATCH -A naiss2024-22-580
#SBATCH -p node -n 1
#SBATCH -C mem256GB 
#SBATCH -t 05:00:00
#SBATCH --output=slurm-logs/binning/SLURM-%j-metaWRAP-binning-CHST.out
#SBATCH --error=slurm-logs/binning/SLURM-%j-metaWRAP-binning-CHST.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Load in modules
module load bioinfo-tools
module load metaWRAP/1.3.2

# Start time and date
echo "$(date)       ${POP}     [Start]"

# Move to the anvio working directory
cd /crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/07-MAG

# Paths and variables
POP=$1
NUM_THREADS=16
OUT_DIR=05-metaWRAP/${POP}
ASSEMBLY=00-FIXED-ASSEMBLY/${POP}/${POP}-contigs.fa
R1=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/${POP}_R1-clean.fq.gz
R2=/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/05-CLEAN-MERGED/${POP}_R2-clean.fq.gz
TMP_R1=${OUT_DIR}/${POP}-tmp-reads/${POP}_1.fastq
TMP_R2=${OUT_DIR}/${POP}-tmp-reads/${POP}_2.fastq

# Generate folder for metaWRAP
if [ ! -d "${OUT_DIR}" ]; then
    mkdir -p ${OUT_DIR}
fi

# Generate a temp directory inside metawrap folder.
mkdir -p ${OUT_DIR}/${POP}-tmp-reads/

# metawrap requires *_1.fastq as format. Hence, tmp versions of files
zcat ${R1} > ${TMP_R1}
zcat ${R2} > ${TMP_R2}

metawrap binning \
    -a ${ASSEMBLY} \
    -o ${OUT_DIR} \
    -t ${NUM_THREADS} \
    -m 256 \
    --metabat2 --maxbin2 --concoct \
    ${TMP_R1} ${TMP_R2}

# Remove the temporary files
rm -r ${OUT_DIR}/${POP}-tmp-reads/

# End time and date
echo "$(date)  ${POP}    [End]"

ndreey / CONURA_WGS

REVAMP: Structured standard for MAG curation for all approaches #48