Open ndreey opened 2 months ago
The first test run took about 14 hours, with first run running for 8.5h and last with 4.5h. However, we were using a k-mer list similar to MEGAHIT's sensitive
preset which performed well in CAMI but is perhaps overkill for our study.
#!/bin/bash
#SBATCH --job-name metaSPAdes
#SBATCH -A naiss2023-22-412
#SBATCH -p node -n 1
#SBATCH -t 08:35:00
#SBATCH -C mem1TB
#SBATCH --output=SLURM-%j-spades-P12002_111.out
#SBATCH --error=SLURM-%j-spades-P12002_111.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Start time and date
echo "$(date) [Start]"
# Load in modules
module load bioinfo-tools
module load spades/3.15.5
# Create directory for trimmed reads if not existing
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/03-ASSEMBLY/spades"
if [ ! -d "$outdir" ]; then
mkdir -p "$outdir"
fi
# Assembling the metagenome
spades.py \
--meta \
-1 02-TRIM/P12002_111_R1-trim.fastq.gz \
-2 02-TRIM/P12002_111_R2-trim.fastq.gz \
--threads 20 \
--memory 1000 \
-k 21,31,41,51,61,71,81,91,121 \
-o $outdir
# Restarting from checkpoint
#spades.py --continue -o $outdir
It generates a scaffolds.fasta
and contigs.fasta
, where contigs.fasta
only includes the contigs in descending order.
contigs.fasta
A C G T N IUPAC Other GC GC_stdev
0.3219 0.1732 0.1732 0.3317 0.0000 0.0000 0.0000 0.3464 0.0707
Main genome scaffold total: 964196
Main genome contig total: 964196
Main genome scaffold sequence total: 601.312 MB
Main genome contig sequence total: 601.312 MB 0.000% gap
Main genome scaffold N/L50: 231058/771
Main genome contig N/L50: 231058/771
Main genome scaffold N/L90: 739665/313
Main genome contig N/L90: 739665/313
Max scaffold length: 18.354 KB
Max contig length: 18.354 KB
Number of scaffolds > 50 KB: 0
% main genome in scaffolds > 50 KB: 0.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 964,196 964,196 601,311,979 601,311,979 100.00%
100 964,196 964,196 601,311,979 601,311,979 100.00%
250 911,348 911,348 591,623,037 591,623,037 100.00%
500 419,143 419,143 417,207,291 417,207,291 100.00%
1 KB 149,105 149,105 229,038,320 229,038,320 100.00%
2.5 KB 9,643 9,643 31,476,420 31,476,420 100.00%
5 KB 476 476 2,949,418 2,949,418 100.00%
10 KB 17 17 215,223 215,223 100.00%
scaffolds.fasta
A C G T N IUPAC Other GC GC_stdev
0.3219 0.1732 0.1732 0.3317 0.0004 0.0000 0.0000 0.3464 0.0708
Main genome scaffold total: 944766
Main genome contig total: 964197
Main genome scaffold sequence total: 601.573 MB
Main genome contig sequence total: 601.312 MB 0.043% gap
Main genome scaffold N/L50: 223177/799
Main genome contig N/L50: 231058/771
Main genome scaffold N/L90: 720678/315
Main genome contig N/L90: 739664/313
Max scaffold length: 20.161 KB
Max contig length: 18.354 KB
Number of scaffolds > 50 KB: 0
% main genome in scaffolds > 50 KB: 0.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 944,766 964,197 601,572,983 601,311,582 99.96%
100 944,766 964,197 601,572,983 601,311,582 99.96%
250 892,385 911,816 591,979,937 591,718,536 99.96%
500 418,989 438,375 424,298,496 424,037,940 99.94%
1 KB 153,549 165,857 238,837,146 238,671,646 99.93%
2.5 KB 10,778 12,699 35,429,516 35,403,525 99.93%
5 KB 570 716 3,515,136 3,513,008 99.94%
10 KB 17 18 224,855 224,845 100.00%
For the next run we will utilize the auto estimation of k-mers and avoid error correction that SPAdes auto enables (as reads are already trimmed).
#!/bin/bash
#SBATCH --job-name metaSPAdes-auto
#SBATCH -A naiss2023-22-412
#SBATCH -p node -n 1
#SBATCH -t 28:35:00
#SBATCH -C mem1TB
#SBATCH --output=SLURM-%j-spades-P12002_111-auto.out
#SBATCH --error=SLURM-%j-spades-P12002_111-auto.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Start time and date
echo "$(date) [Start]"
# Load in modules
module load bioinfo-tools
module load spades/3.15.5
# Create directory for trimmed reads if not existing
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/03-ASSEMBLY/spades-auto"
if [ ! -d "$outdir" ]; then
mkdir -p "$outdir"
fi
# Assembling the metagenome
spades.py \
--meta \
--only-assembler \
-k auto \
--threads 20 \
--memory 1000 \
-1 02-TRIM/P12002_111_R1-trim.fastq.gz \
-2 02-TRIM/P12002_111_R2-trim.fastq.gz \
-o $outdir
# Restarting from checkpoint
#spades.py --continue -o $outdir
contigs.fasta
A C G T N IUPAC Other GC GC_stdev
0.3187 0.1719 0.1718 0.3376 0.0000 0.0000 0.0000 0.3437 0.1137
Main genome scaffold total: 2166603
Main genome contig total: 2166603
Main genome scaffold sequence total: 634.002 MB
Main genome contig sequence total: 634.002 MB 0.000% gap
Main genome scaffold N/L50: 302644/559
Main genome contig N/L50: 302644/559
Main genome scaffold N/L90: 1268092/110
Main genome contig N/L90: 1268092/110
Max scaffold length: 15.893 KB
Max contig length: 15.893 KB
Number of scaffolds > 50 KB: 0
% main genome in scaffolds > 50 KB: 0.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 2,166,603 2,166,603 634,002,380 634,002,380 100.00%
50 2,166,603 2,166,603 634,002,380 634,002,380 100.00%
100 1,340,024 1,340,024 578,629,892 578,629,892 100.00%
250 751,035 751,035 478,491,357 478,491,357 100.00%
500 345,711 345,711 339,906,633 339,906,633 100.00%
1 KB 119,786 119,786 181,831,153 181,831,153 100.00%
2.5 KB 7,010 7,010 22,401,386 22,401,386 100.00%
5 KB 268 268 1,614,541 1,614,541 100.00%
10 KB 6 6 72,050 72,050 100.00%
MEGAHIT is a fast and well benchmarked tool that uses succinct de Bruijn Graphs (SdBG) (PAPER)
Noteworthy, MEGAHIT creates the output directory so it is important that the output directory specified does not exist beforehand. Furthermore, we used a fat node of 256GB and 16 cores. To not maximize the memory we set the memory limit to 0.95 * available memory.
--meta-sensitive
: k list: 21,29,39,49,59,69,79,89,99,109,119,129,141 Noteworthy, MEGAHIT generates checkpoints so if assembly suddenly terminates, prior computation is not lost. Hence, why the --continue
flag is set in the script below, as there were multiple terminates due to time.
#!/bin/bash
#SBATCH --job-name megahit
#SBATCH -A naiss2023-22-412
#SBATCH -p node -n 1
#SBATCH -t 08:35:00
#SBATCH -C mem256GB
#SBATCH --output=SLURM-%j-megahit-P12002_111.out
#SBATCH --error=SLURM-%j-megahit-P12002_111.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Start time and date
echo "$(date) [Start]"
# Load in modules
module load bioinfo-tools
module load megahit/1.2.9
# Create directory for trimmed reads if not existing
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/03-ASSEMBLY/"
if [ ! -d "$outdir" ]; then
mkdir -p "$outdir"
fi
# Load in modules
module load bioinfo-tools
module load megahit/1.2.9
megahit -t 16 --presets meta-sensitive \
-1 02-TRIM/P12002_111_R1-trim.fastq.gz \
-2 02-TRIM/P12002_111_R2-trim.fastq.gz \
--memory 0.95 \
-o $outdir/megahit-sensitive \
--out-prefix P12002_11 \
--continue
In 4h and 35min it managed to solve k21 to k59. Next try same time managed k69-99 and finally, the last run succeded the remaining in 3h. All in all it took around 12h. | k | ~time |
---|---|---|
21 | 1h19min | |
29 | 33min | |
39 | 1h20min | |
49 | 44min | |
59 | 1h | |
69 | 1h | |
79 | 1h10min | |
89 | 1h10min | |
99 | 1h1min | |
109 | 50min | |
119 | 1h | |
129 | 50min | |
141 | 5min |
Using bbmap's stats.sh
we get an idea of the metagenome. Some papers cite using minimum contig length of 1Kb while Anvio recommends 2.5Kb contigs for down stream analysis. From the stats below we see that we have 228,802 configs >1Kb and 14,737 configs > 2.5Kb. All in all the preliminary metagenome is 1.76Gb with 445k contigs making up 50% of the assembly with the L50 value being 715. In conclusion, 87% of our contigs are less than 1Kb in length....
Now this is from a single sample, if populations were pooled we would have more coverage and perhaps get a better N50 and L50.
A C G T N IUPAC Other GC GC_stdev
0.3300 0.1765 0.1764 0.3171 0.0000 0.0000 0.0000 0.3529 0.0709
Main genome scaffold total: 1762416
Main genome contig total: 1762416
Main genome scaffold sequence total: 1062.070 MB
Main genome contig sequence total: 1062.070 MB 0.000% gap
Main genome scaffold N/L50: 445870/715
Main genome contig N/L50: 445870/715
Main genome scaffold N/L90: 1362899/318
Main genome contig N/L90: 1362899/318
Max scaffold length: 20.522 KB
Max contig length: 20.522 KB
Number of scaffolds > 50 KB: 0
% main genome in scaffolds > 50 KB: 0.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 1,762,416 1,762,416 1,062,070,410 1,062,070,410 100.00%
100 1,762,416 1,762,416 1,062,070,410 1,062,070,410 100.00%
250 1,605,316 1,605,316 1,027,260,365 1,027,260,365 100.00%
500 764,592 764,592 722,055,568 722,055,568 100.00%
1 KB 228,802 228,802 349,677,244 349,677,244 100.00%
2.5 KB 14,737 14,737 48,569,631 48,569,631 100.00%
5 KB 865 865 5,417,808 5,417,808 100.00%
10 KB 23 23 275,304 275,304 100.00%
For the next run, we will utilize a smaller set of k-mers based on what SPAdes estimates (21,33,55
).
#!/bin/bash
#SBATCH --job-name megahit-auto
#SBATCH -A naiss2023-22-412
#SBATCH -p node -n 1
#SBATCH -t 12:35:00
#SBATCH -C mem256GB
#SBATCH --output=SLURM-%j-megahit-P12002_111-auto.out
#SBATCH --error=SLURM-%j-megahit-P12002_111-auto.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Start time and date
echo "$(date) [Start]"
# Load in modules
module load bioinfo-tools
module load megahit/1.2.9
# Create directory for trimmed reads if not existing
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/03-ASSEMBLY/"
if [ ! -d "$outdir" ]; then
mkdir -p "$outdir"
fi
# Load in modules
module load bioinfo-tools
module load megahit/1.2.9
megahit \
--k-list 21,33,55 \
--num-cpu-threads 20 \
--memory 0.95 \
-1 02-TRIM/P12002_111_R1-trim.fastq.gz \
-2 02-TRIM/P12002_111_R2-trim.fastq.gz \
-o $outdir/megahit-auto \
--out-prefix P12002_111_auto
Which gave these results! What we see that we have a 60% decrease in the number of contigs compared to the previous run. The N50 is similar to previous run in the sense of percentage from the total amount of contigs, 25%. Furthermore, L50 is 648 which is a bit smaller (which is OK, as we have probably discarded a lot of misassembled contigs). The longest contig is still 20Kb as previous.
What is very clear is the number of contigs above >1Kb and 2Kb is much less.
A C G T N IUPAC Other GC GC_stdev
0.3277 0.1777 0.1776 0.3170 0.0000 0.0000 0.0000 0.3553 0.0636
Main genome scaffold total: 705855
Main genome contig total: 705855
Main genome scaffold sequence total: 364.226 MB
Main genome contig sequence total: 364.226 MB 0.000% gap
Main genome scaffold N/L50: 175819/648
Main genome contig N/L50: 175819/648
Main genome scaffold N/L90: 544487/250
Main genome contig N/L90: 544487/250
Max scaffold length: 20.015 KB
Max contig length: 20.015 KB
Number of scaffolds > 50 KB: 0
% main genome in scaffolds > 50 KB: 0.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 705,855 705,855 364,225,915 364,225,915 100.00%
100 705,855 705,855 364,225,915 364,225,915 100.00%
250 544,487 544,487 328,355,285 328,355,285 100.00%
500 259,672 259,672 230,033,591 230,033,591 100.00%
1 KB 68,236 68,236 96,920,128 96,920,128 100.00%
2.5 KB 2,374 2,374 7,706,773 7,706,773 100.00%
5 KB 118 118 725,717 725,717 100.00%
10 KB 3 3 43,608 43,608 100.00%
SPAdes short read assembly
Minimum Number Total
Scaffold of Contig
Length Contigs Length
-------- -------------- --------------
All 2,166,603 634,002,380
50 2,166,603 634,002,380
100 1,340,024 578,629,892
250 751,035 478,491,357
500 345,711 339,906,633
1 KB 119,786 181,831,153
2.5 KB 7,010 22,401,386
5 KB 268 1,614,541
10 KB 6 72,050
MEGAHIT short read assembly
Minimum Number Total
Scaffold of Contig
Length Contigs Length
-------- -------------- --------------
All 705,855 364,225,915
100 705,855 364,225,915
250 544,487 328,355,285
500 259,672 230,033,591
1 KB 68,236 96,920,128
2.5 KB 2,374 7,706,773
5 KB 118 725,717
10 KB 3 43,608
Critical Assesment of Metagenome Interpretation (CAMI)
CAMI is an open challenge for meta-omic's programs where the developers of said programs test out their programs on multiple different datasets. These datasets are made from CAMISIM which simulates metagenomic data. CAMI used both old genomes as well as newly sequenced. All in all, they have the gold standards of assembly, binning, taxonomic classification and abundance. The results from this challenge was summarized to show a informative benchmark for the performance of the tools.
From the paper (CAMI2) we see that SPADES and MEGAHIT performed really well. The best (fastest) assembler is metaHipMer but it requires a ridiculous amount of memory. MEGAHIT was also incredibly fast and precise, where SPADES also got similar results but required more time and more memory than MEGAHIT. Importantly, SPADES has a hybrid mode which MEGAHIT does not have. Thus, we will assemble using MEGAHIT and SPADES to get a general idea of how the assemblies differs. As well which set of k-mer sizes are optimal for our data.
K-mer size selection
Small kmers are beneficial for filling gaps (insufficient sequencing coverage) in low-coverage regions, whereas large kmers is useful for resolving longer repeats (complex regions). Selecting optimal k-mer sizes becomes tricky in metagenomics as low abundance microbes will have less coverage and repetitive regions are found in most microbes and will thus be amplified.
Both SPAdes and MEGAHIT utilizes multiple k-mer sizes to find optimal assembly. However, reads less than 150bp
k=21,33,55
has proven optimal for most assemblies tested according to Using SPAdes De Novo Assembler. Furthermore, SPAdes auto estimates a set of k-mers and it does it without using N50 statistics. N50 alone is not always good as the contigs might be long but can be horribly missassembled. MEGAHIT does not auto estimate the k-mer size range.We start with assembling the CHST allopatric sample P12002_111.