ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

Assembling the metagenome using SPAdes and MEGAHIT #25

Open ndreey opened 2 months ago

ndreey commented 2 months ago

Critical Assesment of Metagenome Interpretation (CAMI)

CAMI is an open challenge for meta-omic's programs where the developers of said programs test out their programs on multiple different datasets. These datasets are made from CAMISIM which simulates metagenomic data. CAMI used both old genomes as well as newly sequenced. All in all, they have the gold standards of assembly, binning, taxonomic classification and abundance. The results from this challenge was summarized to show a informative benchmark for the performance of the tools.

From the paper (CAMI2) we see that SPADES and MEGAHIT performed really well. The best (fastest) assembler is metaHipMer but it requires a ridiculous amount of memory. MEGAHIT was also incredibly fast and precise, where SPADES also got similar results but required more time and more memory than MEGAHIT. Importantly, SPADES has a hybrid mode which MEGAHIT does not have. Thus, we will assemble using MEGAHIT and SPADES to get a general idea of how the assemblies differs. As well which set of k-mer sizes are optimal for our data.

K-mer size selection

Small kmers are beneficial for filling gaps (insufficient sequencing coverage) in low-coverage regions, whereas large kmers is useful for resolving longer repeats (complex regions). Selecting optimal k-mer sizes becomes tricky in metagenomics as low abundance microbes will have less coverage and repetitive regions are found in most microbes and will thus be amplified.

Both SPAdes and MEGAHIT utilizes multiple k-mer sizes to find optimal assembly. However, reads less than 150bp k=21,33,55 has proven optimal for most assemblies tested according to Using SPAdes De Novo Assembler. Furthermore, SPAdes auto estimates a set of k-mers and it does it without using N50 statistics. N50 alone is not always good as the contigs might be long but can be horribly missassembled. MEGAHIT does not auto estimate the k-mer size range.

We start with assembling the CHST allopatric sample P12002_111.

ndreey commented 2 months ago

SPAdes

The first test run took about 14 hours, with first run running for 8.5h and last with 4.5h. However, we were using a k-mer list similar to MEGAHIT's sensitive preset which performed well in CAMI but is perhaps overkill for our study.

#!/bin/bash

#SBATCH --job-name metaSPAdes
#SBATCH -A naiss2023-22-412
#SBATCH -p node -n 1
#SBATCH -t 08:35:00
#SBATCH -C mem1TB
#SBATCH --output=SLURM-%j-spades-P12002_111.out
#SBATCH --error=SLURM-%j-spades-P12002_111.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load spades/3.15.5

# Create directory for trimmed reads if not existing
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/03-ASSEMBLY/spades"
if [ ! -d "$outdir" ]; then
    mkdir -p "$outdir"
fi

# Assembling the metagenome
spades.py \
    --meta \
    -1 02-TRIM/P12002_111_R1-trim.fastq.gz \
    -2 02-TRIM/P12002_111_R2-trim.fastq.gz \
    --threads 20 \
    --memory 1000 \
    -k 21,31,41,51,61,71,81,91,121 \
    -o $outdir

# Restarting from checkpoint
#spades.py --continue -o $outdir

It generates a scaffolds.fasta and contigs.fasta, where contigs.fasta only includes the contigs in descending order.

contigs.fasta

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.3219  0.1732  0.1732  0.3317  0.0000  0.0000  0.0000  0.3464  0.0707

Main genome scaffold total:             964196
Main genome contig total:               964196
Main genome scaffold sequence total:    601.312 MB
Main genome contig sequence total:      601.312 MB      0.000% gap
Main genome scaffold N/L50:             231058/771
Main genome contig N/L50:               231058/771
Main genome scaffold N/L90:             739665/313
Main genome contig N/L90:               739665/313
Max scaffold length:                    18.354 KB
Max contig length:                      18.354 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                964,196         964,196     601,311,979     601,311,979   100.00%
    100                964,196         964,196     601,311,979     601,311,979   100.00%
    250                911,348         911,348     591,623,037     591,623,037   100.00%
    500                419,143         419,143     417,207,291     417,207,291   100.00%
   1 KB                149,105         149,105     229,038,320     229,038,320   100.00%
 2.5 KB                  9,643           9,643      31,476,420      31,476,420   100.00%
   5 KB                    476             476       2,949,418       2,949,418   100.00%
  10 KB                     17              17         215,223         215,223   100.00%

scaffolds.fasta

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.3219  0.1732  0.1732  0.3317  0.0004  0.0000  0.0000  0.3464  0.0708

Main genome scaffold total:             944766
Main genome contig total:               964197
Main genome scaffold sequence total:    601.573 MB
Main genome contig sequence total:      601.312 MB      0.043% gap
Main genome scaffold N/L50:             223177/799
Main genome contig N/L50:               231058/771
Main genome scaffold N/L90:             720678/315
Main genome contig N/L90:               739664/313
Max scaffold length:                    20.161 KB
Max contig length:                      18.354 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                944,766         964,197     601,572,983     601,311,582    99.96%
    100                944,766         964,197     601,572,983     601,311,582    99.96%
    250                892,385         911,816     591,979,937     591,718,536    99.96%
    500                418,989         438,375     424,298,496     424,037,940    99.94%
   1 KB                153,549         165,857     238,837,146     238,671,646    99.93%
 2.5 KB                 10,778          12,699      35,429,516      35,403,525    99.93%
   5 KB                    570             716       3,515,136       3,513,008    99.94%
  10 KB                     17              18         224,855         224,845   100.00%

For the next run we will utilize the auto estimation of k-mers and avoid error correction that SPAdes auto enables (as reads are already trimmed).

#!/bin/bash

#SBATCH --job-name metaSPAdes-auto
#SBATCH -A naiss2023-22-412
#SBATCH -p node -n 1
#SBATCH -t 28:35:00
#SBATCH -C mem1TB
#SBATCH --output=SLURM-%j-spades-P12002_111-auto.out
#SBATCH --error=SLURM-%j-spades-P12002_111-auto.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load spades/3.15.5

# Create directory for trimmed reads if not existing
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/03-ASSEMBLY/spades-auto"
if [ ! -d "$outdir" ]; then
    mkdir -p "$outdir"
fi

# Assembling the metagenome
spades.py \
    --meta \
    --only-assembler \
    -k auto \
    --threads 20 \
    --memory 1000 \
    -1 02-TRIM/P12002_111_R1-trim.fastq.gz \
    -2 02-TRIM/P12002_111_R2-trim.fastq.gz \
    -o $outdir

# Restarting from checkpoint
#spades.py --continue -o $outdir

contigs.fasta

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.3187  0.1719  0.1718  0.3376  0.0000  0.0000  0.0000  0.3437  0.1137

Main genome scaffold total:             2166603
Main genome contig total:               2166603
Main genome scaffold sequence total:    634.002 MB
Main genome contig sequence total:      634.002 MB      0.000% gap
Main genome scaffold N/L50:             302644/559
Main genome contig N/L50:               302644/559
Main genome scaffold N/L90:             1268092/110
Main genome contig N/L90:               1268092/110
Max scaffold length:                    15.893 KB
Max contig length:                      15.893 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All              2,166,603       2,166,603     634,002,380     634,002,380   100.00%
     50              2,166,603       2,166,603     634,002,380     634,002,380   100.00%
    100              1,340,024       1,340,024     578,629,892     578,629,892   100.00%
    250                751,035         751,035     478,491,357     478,491,357   100.00%
    500                345,711         345,711     339,906,633     339,906,633   100.00%
   1 KB                119,786         119,786     181,831,153     181,831,153   100.00%
 2.5 KB                  7,010           7,010      22,401,386      22,401,386   100.00%
   5 KB                    268             268       1,614,541       1,614,541   100.00%
  10 KB                      6               6          72,050          72,050   100.00%
ndreey commented 2 months ago

MEGAHIT

MEGAHIT is a fast and well benchmarked tool that uses succinct de Bruijn Graphs (SdBG) (PAPER)

Noteworthy, MEGAHIT creates the output directory so it is important that the output directory specified does not exist beforehand. Furthermore, we used a fat node of 256GB and 16 cores. To not maximize the memory we set the memory limit to 0.95 * available memory.

Noteworthy, MEGAHIT generates checkpoints so if assembly suddenly terminates, prior computation is not lost. Hence, why the --continue flag is set in the script below, as there were multiple terminates due to time.

#!/bin/bash

#SBATCH --job-name megahit
#SBATCH -A naiss2023-22-412
#SBATCH -p node -n 1
#SBATCH -t 08:35:00
#SBATCH -C mem256GB
#SBATCH --output=SLURM-%j-megahit-P12002_111.out
#SBATCH --error=SLURM-%j-megahit-P12002_111.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load megahit/1.2.9

# Create directory for trimmed reads if not existing
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/03-ASSEMBLY/"
if [ ! -d "$outdir" ]; then
    mkdir -p "$outdir"
fi

# Load in modules
module load bioinfo-tools
module load megahit/1.2.9

megahit -t 16 --presets meta-sensitive \
    -1 02-TRIM/P12002_111_R1-trim.fastq.gz \
    -2 02-TRIM/P12002_111_R2-trim.fastq.gz \
    --memory 0.95 \
    -o $outdir/megahit-sensitive \
    --out-prefix P12002_11 \
    --continue

Run statistics

In 4h and 35min it managed to solve k21 to k59. Next try same time managed k69-99 and finally, the last run succeded the remaining in 3h. All in all it took around 12h. k ~time
21 1h19min
29 33min
39 1h20min
49 44min
59 1h
69 1h
79 1h10min
89 1h10min
99 1h1min
109 50min
119 1h
129 50min
141 5min

Using bbmap's stats.sh we get an idea of the metagenome. Some papers cite using minimum contig length of 1Kb while Anvio recommends 2.5Kb contigs for down stream analysis. From the stats below we see that we have 228,802 configs >1Kb and 14,737 configs > 2.5Kb. All in all the preliminary metagenome is 1.76Gb with 445k contigs making up 50% of the assembly with the L50 value being 715. In conclusion, 87% of our contigs are less than 1Kb in length.... Now this is from a single sample, if populations were pooled we would have more coverage and perhaps get a better N50 and L50.

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.3300  0.1765  0.1764  0.3171  0.0000  0.0000  0.0000  0.3529  0.0709

Main genome scaffold total:             1762416
Main genome contig total:               1762416
Main genome scaffold sequence total:    1062.070 MB
Main genome contig sequence total:      1062.070 MB     0.000% gap
Main genome scaffold N/L50:             445870/715
Main genome contig N/L50:               445870/715
Main genome scaffold N/L90:             1362899/318
Main genome contig N/L90:               1362899/318
Max scaffold length:                    20.522 KB
Max contig length:                      20.522 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All              1,762,416       1,762,416   1,062,070,410   1,062,070,410   100.00%
    100              1,762,416       1,762,416   1,062,070,410   1,062,070,410   100.00%
    250              1,605,316       1,605,316   1,027,260,365   1,027,260,365   100.00%
    500                764,592         764,592     722,055,568     722,055,568   100.00%
   1 KB                228,802         228,802     349,677,244     349,677,244   100.00%
 2.5 KB                 14,737          14,737      48,569,631      48,569,631   100.00%
   5 KB                    865             865       5,417,808       5,417,808   100.00%
  10 KB                     23              23         275,304         275,304   100.00%

For the next run, we will utilize a smaller set of k-mers based on what SPAdes estimates (21,33,55).

#!/bin/bash

#SBATCH --job-name megahit-auto
#SBATCH -A naiss2023-22-412
#SBATCH -p node -n 1
#SBATCH -t 12:35:00
#SBATCH -C mem256GB
#SBATCH --output=SLURM-%j-megahit-P12002_111-auto.out
#SBATCH --error=SLURM-%j-megahit-P12002_111-auto.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load megahit/1.2.9

# Create directory for trimmed reads if not existing
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/03-ASSEMBLY/"
if [ ! -d "$outdir" ]; then
    mkdir -p "$outdir"
fi

# Load in modules
module load bioinfo-tools
module load megahit/1.2.9

megahit \
    --k-list 21,33,55 \
    --num-cpu-threads 20 \
    --memory 0.95 \
    -1 02-TRIM/P12002_111_R1-trim.fastq.gz \
    -2 02-TRIM/P12002_111_R2-trim.fastq.gz \
    -o $outdir/megahit-auto \
    --out-prefix P12002_111_auto

Which gave these results! What we see that we have a 60% decrease in the number of contigs compared to the previous run. The N50 is similar to previous run in the sense of percentage from the total amount of contigs, 25%. Furthermore, L50 is 648 which is a bit smaller (which is OK, as we have probably discarded a lot of misassembled contigs). The longest contig is still 20Kb as previous.

What is very clear is the number of contigs above >1Kb and 2Kb is much less.

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.3277  0.1777  0.1776  0.3170  0.0000  0.0000  0.0000  0.3553  0.0636

Main genome scaffold total:             705855
Main genome contig total:               705855
Main genome scaffold sequence total:    364.226 MB
Main genome contig sequence total:      364.226 MB      0.000% gap
Main genome scaffold N/L50:             175819/648
Main genome contig N/L50:               175819/648
Main genome scaffold N/L90:             544487/250
Main genome contig N/L90:               544487/250
Max scaffold length:                    20.015 KB
Max contig length:                      20.015 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                705,855         705,855     364,225,915     364,225,915   100.00%
    100                705,855         705,855     364,225,915     364,225,915   100.00%
    250                544,487         544,487     328,355,285     328,355,285   100.00%
    500                259,672         259,672     230,033,591     230,033,591   100.00%
   1 KB                 68,236          68,236      96,920,128      96,920,128   100.00%
 2.5 KB                  2,374           2,374       7,706,773       7,706,773   100.00%
   5 KB                    118             118         725,717         725,717   100.00%
  10 KB                      3               3          43,608          43,608   100.00%
ndreey commented 1 month ago

SPAdes short read assembly

Minimum       Number          Total           
Scaffold      of              Contig          
Length        Contigs         Length          
--------      --------------  --------------  
    All            2,166,603     634,002,380  
     50            2,166,603     634,002,380  
    100            1,340,024     578,629,892  
    250              751,035     478,491,357  
    500              345,711     339,906,633  
   1 KB              119,786     181,831,153  
 2.5 KB                7,010      22,401,386  
   5 KB                  268       1,614,541  
  10 KB                    6          72,050  

MEGAHIT short read assembly

Minimum      Number          Total          
Scaffold     of              Contig         
Length       Contigs         Length         
--------     --------------  -------------- 
    All             705,855     364,225,915 
    100             705,855     364,225,915 
    250             544,487     328,355,285 
    500             259,672     230,033,591 
   1 KB              68,236      96,920,128 
 2.5 KB               2,374       7,706,773 
   5 KB                 118         725,717 
  10 KB                   3          43,608