Open ndreey opened 1 year ago
metaQUAST was released in the third version of QUAST in 2016, and to understand metaQUAST, a primer in QUAST is required.
QUAST is a software tool that evaluates genome assembly quality with or without references. The original version used Nucmer from MUMmer, but the latest version uses the popular tool bowtie2 to compare the assembly to a reference genome and generate metrics.
metaQUAST is a module that comes with QUAST, which is developed to evaluate metagenome assemblies.
The four major steps are a combination of new implementations and using QUAST in an intuitive way.
CAMISIM
generates a gold standard assembly (GSA) when generating the NGS data by combining all reads that had at least 1 coverage on the reference genomes. With this GSA, a gold standard can be set. By running metaquast
with the GSA as input against the reference genomes, a benchmark can be set of how the most optimal assembly would perform. In my case, i can use the GSA evaluation against the two different assemblies i created (k21, meta-sens) to see which one is the best.
The accuracy of the assembled contigs is typically evaluated by aligning them to a reference genome, which is a well-known sequence that closely matches the metagenomic sample. However, in metagenomic samples, the presence of ambiguous regions can make it difficult to accurately align the contigs to the reference genome.
The --unique-mapping
flag is an option in MetaQUAST that addresses this issue by restricting each contig to map to only a single position in the reference genome. By default, MetaQUAST uses the --ambiguity-usage
'all' option, which allows each contig to map to all possible positions in the reference genome. However, in some cases, this can cause an increase in the number of mismatches, especially in repeat regions that are almost always inexact due to accumulated SNPs, indels, etc.
Using the --unique-mapping
flag can limit the number of mismatches and thus improve the accuracy of the evaluation by restricting each contig to map to only one position in the reference genome. However, it should be used with caution because some contigs may still produce multiple distinct mappings to different references in step 2, especially in the case of closely-related strains. In such cases, using the --reuse-combined-alignments
flag can exclude this effect, as it speeds up the computation and may change the step 2 results.
The flag, --no-icarus
stops metaQUAST to generate an interactive viewer, which is a computational resourceful task that we are not interested in.
I found two different commands CAMI used.
In the Supplementary information from the main paper they used this.
metaquast --no-icarus \
--reuse-combined-alignments \
--unique-mapping \
-t 28 \
-o ../strmgCAMI2_co_assembly_metaquast-5.1.0rc1 \
-r `cat ../refs` \
../strmgCAMI2_pooled/GS_*
However, in the Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit paper they used this.
metaquast --no-icarus \
--unique-mapping \
-t 24 \
-o /path/to/output_dir \
-r /path/to/set0-9/ref-genomes \
../contigs/
Where the use of --reuse-combined-alignments
differed.
This is the array job I ran to benchmark GSA and evaluate k21
and meta-sens
.
I did not run --reuse-combined-alignments
as i recently learned about it.
#!/bin/bash
#SBATCH --job-name=metaQUAST_benchmark # name that will show up in the queue
#SBATCH --array=1-3
#SBATCH --output=slurm-benchmark-%j.out # filename of the output; the %j is equal to jobID
#SBATCH --error=slurm-benchmark-%j.err #
#SBATCH --partition=cpuqueue #
#SBATCH --ntasks=1 # number of tasks (analyses) to run
#SBATCH --cpus-per-task=12 # the number of threads allocated to each task
#SBATCH --mem-per-cpu=16G # memory per cpu-core
#SBATCH --time=08:15:00 # time for analysis (day-hour:min:sec)
#SBATCH --mail-type=ALL # send all type of email
#SBATCH --mail-user=andre.bourbonnais@sund.ku.dk
# I. Define directory names [DO NOT CHANGE]
# =========================================
# get the directories
submitdir=${SLURM_SUBMIT_DIR}
workdir=${TMPDIR}
jobid=${SLURM_ARRAY_TASK_ID}
# Information
echo "$(date) Submitted from ${submitdir}"
echo "$(date) Accessed ${workdir}"
echo "$(date) ArrayID: ${jobid}"
# 1. Lock and load module and data
# ============================================
module load quast
ref_genomes=${submitdir}/bsc_thesis/data/references/reference_genomes
gsa=${submitdir}/bsc_thesis/gold_standards/gsa/assembly
k21=${submitdir}/bsc_thesis/processed/assembly/k21
meta_sens=${submitdir}/bsc_thesis/processed/assembly/meta_sens
# 2. Execute [MODIFY COMPLETELY TO YOUR NEEDS]
# ============================================
# Decides which assemblies to evaluate
if [ $jobid == 1 ]; then
assemblies=$gsa
elif [ $jobid == 2 ]; then
assemblies=$k21
else
assemblies=$meta_sens
fi
# Different host-contamination level
suffixes=("gsa" "k21" "meta_sens")
# Define variable based on array id
suffix=${suffixes[$jobid-1]}
metaquast -o metaquast_${suffix} \
-r ${ref_genomes}\
--threads 12 \
--no-icarus \
--unique-mapping \
${assemblies}/*
The benchmark of GSA was the quickest while evaluation of meta-sens
took ~3x more.
JobID JobName Partition AllocCPUS State ExitCode Elapsed
------------ ---------- ---------- ---------- ---------- -------- ----------
894680_1 metaQUAST+ cpuqueue 12 COMPLETED 0:0 01:49:41
894680_1.ba+ batch 12 COMPLETED 0:0 01:49:41
894680_2 metaQUAST+ cpuqueue 12 COMPLETED 0:0 04:25:33
894680_2.ba+ batch 12 COMPLETED 0:0 04:25:33
894680_3 metaQUAST+ cpuqueue 12 COMPLETED 0:0 06:07:57
894680_3.ba+ batch 12 COMPLETED 0:0 06:07:57
To get a better evaluation of which assembly parameter was best i am running this script that will compare each assembly with its corresponding GSA.
#!/bin/bash
#SBATCH --job-name=gsaQUAST_benchmark # name that will show up in the queue
#SBATCH --array=1-11%5
#SBATCH --output=slurm-gsa-%j.out # filename of the output; the %j is equal to jobID
#SBATCH --error=slurm-gsa-%j.err #
#SBATCH --partition=cpuqueue #
#SBATCH --ntasks=1 # number of tasks (analyses) to run
#SBATCH --cpus-per-task=6 # the number of threads allocated to each task
#SBATCH --mem-per-cpu=8G # memory per cpu-core
#SBATCH --time=04:15:00 # time for analysis (day-hour:min:sec)
#SBATCH --mail-type=ALL # send all type of email
#SBATCH --mail-user=andre.bourbonnais@sund.ku.dk
# I. Define directory names [DO NOT CHANGE]
# =========================================
# get the directories
submitdir=${SLURM_SUBMIT_DIR}
workdir=${TMPDIR}
jobid=${SLURM_ARRAY_TASK_ID}
# Information
echo "$(date) Submitted from ${submitdir}"
echo "$(date) Accessed ${workdir}"
echo "$(date) ArrayID: ${jobid}"
# 1. Lock and load module and data
# ============================================
module load quast
gsa=${submitdir}/bsc_thesis/gold_standards/gsa/assembly
k21=${submitdir}/bsc_thesis/processed/assembly/k21
meta_sens=${submitdir}/bsc_thesis/processed/assembly/meta_sens
# 2. Execute [MODIFY COMPLETELY TO YOUR NEEDS]
# ============================================
# Different host-contamination level
hc_level=("00" "01" "02" "03" "04" "05" "06" "07" "08" "09" "095")
# Define prefix based on array id
hc_prefix=${hc_level[$jobid-1]}
quast -o quast_${hc_prefix} \
-r ${gsa}/${hc_prefix}_* \
--threads 6 \
--no-icarus \
${k21}/${hc_prefix}_* \
${meta_sens}/${hc_prefix}_*
echo "$(date) HC: ${hc_prefix}"
This is the code they used in
CAMI II
( edit 2023-04-14 updated version can be found below, this was indeed wrong) I am running this atm to get a feel for the program. So far, i have had no success with
--unique-mapping
. The program stalls out in theRunning Contig analyzer...
step on the 02 alignment.