ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

Benchmark: metaQUAST #59

Open ndreey opened 1 year ago

ndreey commented 1 year ago

This is the code they used in CAMI II

metaquast -r /path/to/set0-9/ref-genomes \
    -t 24 --unique-mapping --no-icarus -o /path/to/output_dir \
    -l megahit-103-df,megahit-113-df,megahit113-ml,\
    megahit-113-ms,megahit-129-df,metaSPAdes \
    /path/to/megahit103-Sample0-9-default/final.contigs.fa \
    /path/to/megahit113-Sample0-9-default/final.contigs.fa \
    /path/to/megahit113-Sample0-9-meta-large/final.contigs.fa \
    /path/to/megahit113-Sample0-9-metasensitive/final.contigs.fa \
    /path/to/megahit129-Sample0-9-default/final.contigs.fa \
    /path/to/metaSPAdes3130-Sample0-9/contigs.fasta

( edit 2023-04-14 updated version can be found below, this was indeed wrong) I am running this atm to get a feel for the program. So far, i have had no success with --unique-mapping. The program stalls out in the Running Contig analyzer... step on the 02 alignment.

metaquast -o quast_results \
          -r gsa \
          --threads 6 \
          -l 00,01,02 \
          --no-icarus \
          megahit_results/platanthera_mock_assembly/00/00.contigs.fa \
          megahit_results/platanthera_mock_assembly/01/01.contigs.fa \
          megahit_results/platanthera_mock_assembly/02/02.contigs.fa
ndreey commented 1 year ago

metaQUAST

metaQUAST was released in the third version of QUAST in 2016, and to understand metaQUAST, a primer in QUAST is required.

QUAST is a software tool that evaluates genome assembly quality with or without references. The original version used Nucmer from MUMmer, but the latest version uses the popular tool bowtie2 to compare the assembly to a reference genome and generate metrics.

metaQUAST is a module that comes with QUAST, which is developed to evaluate metagenome assemblies.

The metaQUAST Pipeline

The four major steps are a combination of new implementations and using QUAST in an intuitive way.

  1. All references are combined into one big reference file, which each assembly is aligned against. The original QUAST reported contigs that aligned to multiple locations (translocation) in the reference genome as ambigious alignments. In metaQUAST, these ambigious alignments can represent interspecies translocations by counting how the contigs align to multiple reference genomes.
  2. All assembly contigs are then grouped based on where they mapped, contigs that mapped to multiple genomes goes into each genome group (unaligned contigs are a group in and of themselves).
  3. The heavy lifting step where QUAST is run with each reference genome and the corresponding grouped contigs.
  4. All QUAST results are then grouped and reports are generated.

Metrics

ndreey commented 1 year ago

The CAMI2 Way

CAMISIM generates a gold standard assembly (GSA) when generating the NGS data by combining all reads that had at least 1 coverage on the reference genomes. With this GSA, a gold standard can be set. By running metaquast with the GSA as input against the reference genomes, a benchmark can be set of how the most optimal assembly would perform. In my case, i can use the GSA evaluation against the two different assemblies i created (k21, meta-sens) to see which one is the best.

The accuracy of the assembled contigs is typically evaluated by aligning them to a reference genome, which is a well-known sequence that closely matches the metagenomic sample. However, in metagenomic samples, the presence of ambiguous regions can make it difficult to accurately align the contigs to the reference genome.

The --unique-mapping flag is an option in MetaQUAST that addresses this issue by restricting each contig to map to only a single position in the reference genome. By default, MetaQUAST uses the --ambiguity-usage 'all' option, which allows each contig to map to all possible positions in the reference genome. However, in some cases, this can cause an increase in the number of mismatches, especially in repeat regions that are almost always inexact due to accumulated SNPs, indels, etc.

Using the --unique-mapping flag can limit the number of mismatches and thus improve the accuracy of the evaluation by restricting each contig to map to only one position in the reference genome. However, it should be used with caution because some contigs may still produce multiple distinct mappings to different references in step 2, especially in the case of closely-related strains. In such cases, using the --reuse-combined-alignments flag can exclude this effect, as it speeds up the computation and may change the step 2 results.

The flag, --no-icarus stops metaQUAST to generate an interactive viewer, which is a computational resourceful task that we are not interested in.

Command used

I found two different commands CAMI used.

In the Supplementary information from the main paper they used this.

metaquast --no-icarus \
    --reuse-combined-alignments \
    --unique-mapping \
    -t 28 \    
    -o ../strmgCAMI2_co_assembly_metaquast-5.1.0rc1 \
    -r `cat ../refs` \
    ../strmgCAMI2_pooled/GS_*

However, in the Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit paper they used this.

metaquast --no-icarus \
    --unique-mapping \
    -t 24  \
    -o /path/to/output_dir \
    -r /path/to/set0-9/ref-genomes \
    ../contigs/

Where the use of --reuse-combined-alignments differed.

ndreey commented 1 year ago

My benchmark as of 2023-04-14

This is the array job I ran to benchmark GSA and evaluate k21 and meta-sens. I did not run --reuse-combined-alignments as i recently learned about it.

#!/bin/bash

#SBATCH --job-name=metaQUAST_benchmark     # name that will show up in the queue
#SBATCH --array=1-3
#SBATCH --output=slurm-benchmark-%j.out             # filename of the output; the %j is equal to jobID
#SBATCH --error=slurm-benchmark-%j.err              #
#SBATCH --partition=cpuqueue              #
#SBATCH --ntasks=1                        # number of tasks (analyses) to run
#SBATCH --cpus-per-task=12              # the number of threads allocated to each task
#SBATCH --mem-per-cpu=16G                  # memory per cpu-core
#SBATCH --time=08:15:00                   # time for analysis (day-hour:min:sec)
#SBATCH --mail-type=ALL                   # send all type of email
#SBATCH --mail-user=andre.bourbonnais@sund.ku.dk

# I. Define directory names [DO NOT CHANGE]
# =========================================

# get the directories
submitdir=${SLURM_SUBMIT_DIR}
workdir=${TMPDIR}
jobid=${SLURM_ARRAY_TASK_ID}

# Information
echo "$(date)    Submitted from ${submitdir}"
echo "$(date)    Accessed ${workdir}"
echo "$(date)    ArrayID: ${jobid}"

# 1. Lock and load module and data
# ============================================
module load quast

ref_genomes=${submitdir}/bsc_thesis/data/references/reference_genomes
gsa=${submitdir}/bsc_thesis/gold_standards/gsa/assembly
k21=${submitdir}/bsc_thesis/processed/assembly/k21
meta_sens=${submitdir}/bsc_thesis/processed/assembly/meta_sens

# 2. Execute [MODIFY COMPLETELY TO YOUR NEEDS]
# ============================================

# Decides which assemblies to evaluate
if [ $jobid == 1 ]; then
  assemblies=$gsa
elif [ $jobid == 2 ]; then
  assemblies=$k21
else
  assemblies=$meta_sens
fi

# Different host-contamination level
suffixes=("gsa" "k21" "meta_sens")
# Define variable based on array id
suffix=${suffixes[$jobid-1]}

metaquast -o metaquast_${suffix} \
          -r ${ref_genomes}\
          --threads 12 \
          --no-icarus \
          --unique-mapping \
          ${assemblies}/*

The benchmark of GSA was the quickest while evaluation of meta-sens took ~3x more.

       JobID    JobName  Partition  AllocCPUS      State ExitCode    Elapsed
------------ ---------- ---------- ---------- ---------- -------- ----------
894680_1     metaQUAST+   cpuqueue         12  COMPLETED      0:0   01:49:41
894680_1.ba+      batch                    12  COMPLETED      0:0   01:49:41
894680_2     metaQUAST+   cpuqueue         12  COMPLETED      0:0   04:25:33
894680_2.ba+      batch                    12  COMPLETED      0:0   04:25:33
894680_3     metaQUAST+   cpuqueue         12  COMPLETED      0:0   06:07:57
894680_3.ba+      batch                    12  COMPLETED      0:0   06:07:57
ndreey commented 1 year ago

EXPLORATORY

To get a better evaluation of which assembly parameter was best i am running this script that will compare each assembly with its corresponding GSA.

#!/bin/bash

#SBATCH --job-name=gsaQUAST_benchmark     # name that will show up in the queue
#SBATCH --array=1-11%5
#SBATCH --output=slurm-gsa-%j.out             # filename of the output; the %j is equal to jobID
#SBATCH --error=slurm-gsa-%j.err              #
#SBATCH --partition=cpuqueue              #
#SBATCH --ntasks=1                        # number of tasks (analyses) to run
#SBATCH --cpus-per-task=6              # the number of threads allocated to each task
#SBATCH --mem-per-cpu=8G                  # memory per cpu-core
#SBATCH --time=04:15:00                   # time for analysis (day-hour:min:sec)
#SBATCH --mail-type=ALL                   # send all type of email
#SBATCH --mail-user=andre.bourbonnais@sund.ku.dk

# I. Define directory names [DO NOT CHANGE]
# =========================================

# get the directories
submitdir=${SLURM_SUBMIT_DIR}
workdir=${TMPDIR}
jobid=${SLURM_ARRAY_TASK_ID}

# Information
echo "$(date)    Submitted from ${submitdir}"
echo "$(date)    Accessed ${workdir}"
echo "$(date)    ArrayID: ${jobid}"

# 1. Lock and load module and data
# ============================================
module load quast

gsa=${submitdir}/bsc_thesis/gold_standards/gsa/assembly
k21=${submitdir}/bsc_thesis/processed/assembly/k21
meta_sens=${submitdir}/bsc_thesis/processed/assembly/meta_sens

# 2. Execute [MODIFY COMPLETELY TO YOUR NEEDS]
# ============================================

# Different host-contamination level
hc_level=("00" "01" "02" "03" "04" "05" "06" "07" "08" "09" "095")

# Define prefix based on array id
hc_prefix=${hc_level[$jobid-1]}

quast -o quast_${hc_prefix} \
        -r ${gsa}/${hc_prefix}_* \
        --threads 6 \
        --no-icarus \
        ${k21}/${hc_prefix}_* \
        ${meta_sens}/${hc_prefix}_*

echo "$(date)    HC: ${hc_prefix}"