This is the code they used in CAMI II

metaquast -r /path/to/set0-9/ref-genomes \
    -t 24 --unique-mapping --no-icarus -o /path/to/output_dir \
    -l megahit-103-df,megahit-113-df,megahit113-ml,\
    megahit-113-ms,megahit-129-df,metaSPAdes \
    /path/to/megahit103-Sample0-9-default/final.contigs.fa \
    /path/to/megahit113-Sample0-9-default/final.contigs.fa \
    /path/to/megahit113-Sample0-9-meta-large/final.contigs.fa \
    /path/to/megahit113-Sample0-9-metasensitive/final.contigs.fa \
    /path/to/megahit129-Sample0-9-default/final.contigs.fa \
    /path/to/metaSPAdes3130-Sample0-9/contigs.fasta

( edit 2023-04-14 updated version can be found below, this was indeed wrong) I am running this atm to get a feel for the program. So far, i have had no success with --unique-mapping. The program stalls out in the Running Contig analyzer... step on the 02 alignment.

metaquast -o quast_results \
          -r gsa \
          --threads 6 \
          -l 00,01,02 \
          --no-icarus \
          megahit_results/platanthera_mock_assembly/00/00.contigs.fa \
          megahit_results/platanthera_mock_assembly/01/01.contigs.fa \
          megahit_results/platanthera_mock_assembly/02/02.contigs.fa

metaQUAST

metaQUAST was released in the third version of QUAST in 2016, and to understand metaQUAST, a primer in QUAST is required.

QUAST is a software tool that evaluates genome assembly quality with or without references. The original version used Nucmer from MUMmer, but the latest version uses the popular tool bowtie2 to compare the assembly to a reference genome and generate metrics.

Importantly, QUAST can evaluate multiple assemblies against ONE REFERENCE

metaQUAST is a module that comes with QUAST, which is developed to evaluate metagenome assemblies.

Importantly, is can use MULTIPLE REFERENCES for alignment and can asses possible chimeric contigs.

The metaQUAST Pipeline

The four major steps are a combination of new implementations and using QUAST in an intuitive way.

All references are combined into one big reference file, which each assembly is aligned against. The original QUAST reported contigs that aligned to multiple locations (translocation) in the reference genome as ambigious alignments. In metaQUAST, these ambigious alignments can represent interspecies translocations by counting how the contigs align to multiple reference genomes.
All assembly contigs are then grouped based on where they mapped, contigs that mapped to multiple genomes goes into each genome group (unaligned contigs are a group in and of themselves).
The heavy lifting step where QUAST is run with each reference genome and the corresponding grouped contigs.
All QUAST results are then grouped and reports are generated.

Metrics

No. of contigs: The total number of contigs in the assembly
Largest contig: The length of the largest contig in the assembly
Total length: The total number of bases in the assembly.
Nx (0≤x≤100): N50 is Nx where x = 0.5. if I wanted N75 I would take sum of contigs * 0.75 and so on. It’s the measure of contiguity.
NGx, Genome Nx: Same as before, but instead of sum of contigs, it calculates based on the reference genome.
No. of misassemblies: “The number of misassemblies, using Plantagora’s definition. A misassembly breakpoint is a position in the assembled contig where the left flanking sequence aligns over 1kb away from the right flanking sequence on the reference, or they overlap by >1kb, or the flanking sequences align on opposite strands or different chromosomes”. Meaning, if QUAST notices that the assembled contig actually should be two different contigs or if two assembled contigs should actually be one contig, then it is defined as a misassembly.
No. of misassembled contigs: The number of contigs that contain misassembly breakpoints.
Misassembled contigs length: The total number fo bases contained in all contigs that have one or more misassemblies
No. of unaligned contigs: The number of contigs that have no alignment to the reference sequence.
No. of ambiguously mapped contigs: The number of contigs that have high-scoring reference alignments of equal quality in multiple locations on the reference genome.
Genome fraction (%): Or, reference coverage, measures how well a genome assembly matches up with a reference genome. It is calculated by taking the total number of aligned bases in the reference and dividing it by the size of the genome. Contigs from repeat regions may be counted multiple times as it aligns to multiple places.
Duplication ratio: The duplication ratio is a metric that compares the genome assembly to a reference genome and measures how much of the assembly is the same as the reference. If the ratio is greater than 1, it indicates that there are more repeated parts in the assembly than in the reference genome. This can happen if the assembler overestimates the number of repeat sequences or other similar regions in the genome. Duplication ratio = (Total number of aligned bases in the assembly) / (Total number of aligned bases in the reference genome)
GC (%): The GC content of the assembly.
No. of mismatches per 100kb: Average number of mismatches per 100 000 bases, but be aware it does not distinguish from single-nucleotide polymorphisms (SNPs) or SN errors (SNEs).
No. of indels per 100kb: Average number of indels per 100 000 aligned bases.
No. of interspecies translocations: Instead of having the location being on a different chromosome, MetaQUASAT can determine if assembly also algins on another reference genome.
No. of possibly misassembled contigs: refers to the number of contigs in the metagenomic assembly that may be incorrectly assembled due to the presence of both large aligned and unaligned fragments. These contigs may contain sequences from multiple species, leading to a potentially chimeric assembly. The term "chimeric" is relevant here because it refers to the presence of sequences from multiple sources in a single contig, which is a type of misassembly.

The CAMI2 Way

CAMISIM generates a gold standard assembly (GSA) when generating the NGS data by combining all reads that had at least 1 coverage on the reference genomes. With this GSA, a gold standard can be set. By running metaquast with the GSA as input against the reference genomes, a benchmark can be set of how the most optimal assembly would perform. In my case, i can use the GSA evaluation against the two different assemblies i created (k21, meta-sens) to see which one is the best.

The accuracy of the assembled contigs is typically evaluated by aligning them to a reference genome, which is a well-known sequence that closely matches the metagenomic sample. However, in metagenomic samples, the presence of ambiguous regions can make it difficult to accurately align the contigs to the reference genome.

The --unique-mapping flag is an option in MetaQUAST that addresses this issue by restricting each contig to map to only a single position in the reference genome. By default, MetaQUAST uses the --ambiguity-usage 'all' option, which allows each contig to map to all possible positions in the reference genome. However, in some cases, this can cause an increase in the number of mismatches, especially in repeat regions that are almost always inexact due to accumulated SNPs, indels, etc.

Using the --unique-mapping flag can limit the number of mismatches and thus improve the accuracy of the evaluation by restricting each contig to map to only one position in the reference genome. However, it should be used with caution because some contigs may still produce multiple distinct mappings to different references in step 2, especially in the case of closely-related strains. In such cases, using the --reuse-combined-alignments flag can exclude this effect, as it speeds up the computation and may change the step 2 results.

The flag, --no-icarus stops metaQUAST to generate an interactive viewer, which is a computational resourceful task that we are not interested in.

Command used

I found two different commands CAMI used.

In the Supplementary information from the main paper they used this.

metaquast --no-icarus \
    --reuse-combined-alignments \
    --unique-mapping \
    -t 28 \    
    -o ../strmgCAMI2_co_assembly_metaquast-5.1.0rc1 \
    -r `cat ../refs` \
    ../strmgCAMI2_pooled/GS_*

However, in the Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit paper they used this.

metaquast --no-icarus \
    --unique-mapping \
    -t 24  \
    -o /path/to/output_dir \
    -r /path/to/set0-9/ref-genomes \
    ../contigs/

Where the use of --reuse-combined-alignments differed.

My benchmark as of 2023-04-14

This is the array job I ran to benchmark GSA and evaluate k21 and meta-sens. I did not run --reuse-combined-alignments as i recently learned about it.

#!/bin/bash

#SBATCH --job-name=metaQUAST_benchmark     # name that will show up in the queue
#SBATCH --array=1-3
#SBATCH --output=slurm-benchmark-%j.out             # filename of the output; the %j is equal to jobID
#SBATCH --error=slurm-benchmark-%j.err              #
#SBATCH --partition=cpuqueue              #
#SBATCH --ntasks=1                        # number of tasks (analyses) to run
#SBATCH --cpus-per-task=12              # the number of threads allocated to each task
#SBATCH --mem-per-cpu=16G                  # memory per cpu-core
#SBATCH --time=08:15:00                   # time for analysis (day-hour:min:sec)
#SBATCH --mail-type=ALL                   # send all type of email
#SBATCH --mail-user=andre.bourbonnais@sund.ku.dk

# I. Define directory names [DO NOT CHANGE]
# =========================================

# get the directories
submitdir=${SLURM_SUBMIT_DIR}
workdir=${TMPDIR}
jobid=${SLURM_ARRAY_TASK_ID}

# Information
echo "$(date)    Submitted from ${submitdir}"
echo "$(date)    Accessed ${workdir}"
echo "$(date)    ArrayID: ${jobid}"

# 1. Lock and load module and data
# ============================================
module load quast

ref_genomes=${submitdir}/bsc_thesis/data/references/reference_genomes
gsa=${submitdir}/bsc_thesis/gold_standards/gsa/assembly
k21=${submitdir}/bsc_thesis/processed/assembly/k21
meta_sens=${submitdir}/bsc_thesis/processed/assembly/meta_sens

# 2. Execute [MODIFY COMPLETELY TO YOUR NEEDS]
# ============================================

# Decides which assemblies to evaluate
if [ $jobid == 1 ]; then
  assemblies=$gsa
elif [ $jobid == 2 ]; then
  assemblies=$k21
else
  assemblies=$meta_sens
fi

# Different host-contamination level
suffixes=("gsa" "k21" "meta_sens")
# Define variable based on array id
suffix=${suffixes[$jobid-1]}

metaquast -o metaquast_${suffix} \
          -r ${ref_genomes}\
          --threads 12 \
          --no-icarus \
          --unique-mapping \
          ${assemblies}/*

The benchmark of GSA was the quickest while evaluation of meta-sens took ~3x more.

       JobID    JobName  Partition  AllocCPUS      State ExitCode    Elapsed
------------ ---------- ---------- ---------- ---------- -------- ----------
894680_1     metaQUAST+   cpuqueue         12  COMPLETED      0:0   01:49:41
894680_1.ba+      batch                    12  COMPLETED      0:0   01:49:41
894680_2     metaQUAST+   cpuqueue         12  COMPLETED      0:0   04:25:33
894680_2.ba+      batch                    12  COMPLETED      0:0   04:25:33
894680_3     metaQUAST+   cpuqueue         12  COMPLETED      0:0   06:07:57
894680_3.ba+      batch                    12  COMPLETED      0:0   06:07:57

EXPLORATORY

To get a better evaluation of which assembly parameter was best i am running this script that will compare each assembly with its corresponding GSA.

#!/bin/bash

#SBATCH --job-name=gsaQUAST_benchmark     # name that will show up in the queue
#SBATCH --array=1-11%5
#SBATCH --output=slurm-gsa-%j.out             # filename of the output; the %j is equal to jobID
#SBATCH --error=slurm-gsa-%j.err              #
#SBATCH --partition=cpuqueue              #
#SBATCH --ntasks=1                        # number of tasks (analyses) to run
#SBATCH --cpus-per-task=6              # the number of threads allocated to each task
#SBATCH --mem-per-cpu=8G                  # memory per cpu-core
#SBATCH --time=04:15:00                   # time for analysis (day-hour:min:sec)
#SBATCH --mail-type=ALL                   # send all type of email
#SBATCH --mail-user=andre.bourbonnais@sund.ku.dk

# I. Define directory names [DO NOT CHANGE]
# =========================================

# get the directories
submitdir=${SLURM_SUBMIT_DIR}
workdir=${TMPDIR}
jobid=${SLURM_ARRAY_TASK_ID}

# Information
echo "$(date)    Submitted from ${submitdir}"
echo "$(date)    Accessed ${workdir}"
echo "$(date)    ArrayID: ${jobid}"

# 1. Lock and load module and data
# ============================================
module load quast

gsa=${submitdir}/bsc_thesis/gold_standards/gsa/assembly
k21=${submitdir}/bsc_thesis/processed/assembly/k21
meta_sens=${submitdir}/bsc_thesis/processed/assembly/meta_sens

# 2. Execute [MODIFY COMPLETELY TO YOUR NEEDS]
# ============================================

# Different host-contamination level
hc_level=("00" "01" "02" "03" "04" "05" "06" "07" "08" "09" "095")

# Define prefix based on array id
hc_prefix=${hc_level[$jobid-1]}

quast -o quast_${hc_prefix} \
        -r ${gsa}/${hc_prefix}_* \
        --threads 6 \
        --no-icarus \
        ${k21}/${hc_prefix}_* \
        ${meta_sens}/${hc_prefix}_*

echo "$(date)    HC: ${hc_prefix}"

ndreey / ghost-magnet

Benchmark: metaQUAST #59