A metagenomic analysis is exhaustive and complicated as it utilizes a plathera of tools. Through this documentation, the steps taken to produce the Candidatus Stammerula tephritis metagenome-assembled genome (MAG) is described.
There are multiple ways of getting from raw sequences to a MAG. During this research project, a multitude of tools, wrappers, programs and workflow managers were tested. Thus, the repository is filled with various scripts, and as this research project has ended, my research thesis will start and will be a continuation of this work. Hence, scripts are abundant and important, as i might require them in the future.
To understand my decision making, fallbacks and wins, please look at the issues tab on this repository. It has been used as a log book and might come into hand. As well as my project report.
Finally, before starting this journey, whole-metagenome analysis is quite a hefty mission. Basic bioinformatical foundamentals must already be understood before getting lost in the jungle of metagenomics.
This guides assumes you have these fundamentals, and therefore, I will speed through the many steps. I would say there are three major steps
Quality control and Data wrangling
. Many ways to wrangle and structure. Takes time to find optimal.Assemble metagenome and populate database with contig and alignment stats
. Anvio has so many tools...Bin curation
: Binning, then populating bins with various statistics, and then finding which one that is best.
Hence, head my warning, some scripts are hard coded, some scripts you must give an argument/arguments, some work with sbatch
(SLURM), some require interactive sessions, some require manual editing script parameters.
From raw reads to assembled metagenome is quite straight forward.
Setting up Anvio contig and profile databases, and binning this new formatted metagenome is a breeze and well structured.
It is at the moment the bin refinement into a MAG that is clunky and requires some hours of work. It requires a lot of analyzing different results.
Nonetheless, here we go!
These are the tools and their versions, that was used to perfome this scientific study.
Noteworthy, metaWRAP v.1.3.2 forexample includes the binners CONCOCT
. MaxBin2
. and MetaBAT2
. Depending on the version of metaWRAP
, the version of the binners might differ.
One can approach this by doing all the steps by themselves. but the wrapper programs such as Anvio
, Bakta
and metaWRAP
makes ones life easier.
Tool | Version |
---|---|
fastp | 0.23.4 |
Kraken2 | 2.1.2 |
BWA-MEM | 0.7.17 |
BEDTools | 2.31.1 |
Minimap2 | 2.26 |
SPAdes | 3.15.5 |
metaSPAdes | 3.15.5 |
hybridSPAdes | 3.15.5 |
Anvio | 8.0 |
Bowtie2 | tba |
Prodigal | 2.6.3 |
metaWRAP | 1.3.2 |
CONCOCT | 1.0.0 |
MaxBin2 | 2.2.7 |
MetaBAT2 | 2.15 |
CheckM | 1.0.18 |
CheckM2 | 1.0.1 |
BUSCO | 5.5.0 |
GTDB-Tk | 2.4.0 |
Bakta | 1.9.3 |
BLAST+ | 2.15.0 |
R script dotPlotly | N/A |
fastp
, removing adapter sequences, low-quality reads (Phred score < 20), and reads shorter than 36 bp.BWA-MEM
for short reads and Minimap2
for long reads.bamtofastq
module from BEDTools
.SPAdes
for different population samples, utilizing metaSPAdes
for short reads and hybridSPAdes
for mixed read types.metaWRAP
, integrating CONCOCT
, MaxBin2
, and MetaBAT2
to refine bins based on completeness and contamination thresholds.CheckM2
, BUSCO
and Anvio
.GTDB-Tk
and annotated using Bakta
.Utilizing metadata files in doc/
and scripts in scripts
we will blast through the quality control.
The sample data is split across populations, across samples, and across lanes.
Therefore, run the script fastp-trim-per-lane.sh
.
It utilizes the .lst
files that holds paths to corresponding sample lanes.
This script will put fastp
quality reports in 01-QC
and the trimmed lane .fastq
files in 02-TRIM/
These were the settings used for fastp
fastp \
--in1 $R1_IN --in2 $R2_IN \
--out1 $R1_OUT --out2 $R2_OUT \
--html "${FASTP_DIR}/${R2_TRIMD}-fastp.html" \
--json "${FASTP_DIR}/${R2_TRIMD}-fastp.json" \
--average_qual 20 \
--length_required 36 \
--detect_adapter_for_pe \
--trim_poly_g \
--dedup \
--thread 16 \
--verbose
Run the bwa
and minimap
scripts found here:
scripts/decontamination/
├── bwa-index-genome.sh
├── bwa-mem-decon.sh
├── kneaddata_run.sh
├── minimap2-decon.sh
└── minimap2-index.sh
They will create a corresponding bwa
and minimap
index of the T. conura reference genome and place the clean reads in 04-CLEAN-FASTQ
.
The indexes and alignments will be stored in 03-HOST-ALIGNMENT
for all population, as seen below.
03-HOST-ALIGNMENT-BAM/
CHES CHFI CHSC CHSK CHST COES COGE COLI COSK CPSC hifi-pacbio
03-HOST-ALIGNMENT-BAM/
├── CHES
│ ├── P12002_144_S15_L001
│ │ ├── P12002_144_S15_L001.bam
│ │ ├── P12002_144_S15_L001-sorted-unmapped-pairs.bam
│ │ ├── P12002_144_S15_L001-unmapped-pairs.bam
│ │ └── P12002_144_S15_L001-unmapped-reads.bam
│ ├── P12002_144_S60_L003
│ │ ├── P12002_144_S60_L003.bam
│ │ ├── P12002_144_S60_L003-sorted-unmapped-pairs.bam
│ │ ├── P12002_144_S60_L003-unmapped-pairs.bam
│ │ └── P12002_144_S60_L003-unmapped-reads.bam
│ ├── P12002_144_S60_L006
...
The clean reads will be stored as following.
04-CLEAN-FASTQ/
├── CHES
│ ├── P12002_144_S15_L001_R1-clean.fastq.gz
│ ├── P12002_144_S15_L001_R2-clean.fastq.gz
│ ├── P12002_144_S60_L003_R1-clean.fastq.gz
│ ├── P12002_144_S60_L003_R2-clean.fastq.gz
│ ├── P12002_144_S60_L006_R1-clean.fastq.gz
...
│ └── P12002_153_S63_L006_R2-clean.fastq.gz
├── CHFI
│ ├── P12002_122_S20_L002_R1-clean.fastq.gz
│ ├── P12002_122_S20_L002_R2-clean.fastq.gz
...
Merging is as simple as concatenating, however, depending on how you want to merge they will differ. I merged all reads, population wise, and host plant wise.
Utilize these, and you will have merged .fastq
files as following.
scripts/merge-clean-reads/
├── merge-all-reads.sh
├── merge-clean-reads.sh
└── merge-hostplant-reads.sh
05-CLEAN-MERGED/
├── all_R1-clean.fq.gz
├── all_R2-clean.fq.gz
├── CHES_R1-clean.fq.gz
├── CHES_R2-clean.fq.gz
├── CHFI_R1-clean.fq.gz
├── CHFI_R2-clean.fq.gz
├── CH_R1-clean.fq.gz
├── CH_R2-clean.fq.gz
├── CHSC_R1-clean.fq.gz
├── CHSC_R2-clean.fq.gz
├── CHSK_R1-clean.fq.gz
├── CHSK_R2-clean.fq.gz
├── CHST_R1-clean.fq.gz
├── CHST_R2-clean.fq.gz
├── COES_R1-clean.fq.gz
├── COES_R2-clean.fq.gz
├── COGE_R1-clean.fq.gz
├── COGE_R2-clean.fq.gz
├── COLI_R1-clean.fq.gz
├── COLI_R2-clean.fq.gz
├── CO_R1-clean.fq.gz
├── CO_R2-clean.fq.gz
├── COSK_R1-clean.fq.gz
├── COSK_R2-clean.fq.gz
├── CPSC_R1-clean.fq.gz
└── CPSC_R2-clean.fq.gz
Wonderful, done with data wrangling.
Here, you will need to utilize two scripts: hybridspades-assembly.sh
, and metaspades-assembly.sh
depending on the data.
CHST
and all
utilize long read data in a hybrid fashion, thus, these metagenomes were assembled with hybridspades-assembly.sh
.
If the population only consists of short reads, then use metaspades-assembly.sh
.
scripts/assembly
├── hybridspades-assembly.sh
├── megahit-metagenome-assembly.sh
├── metaspades-assembly.sh
├── spades-lr-metagenome-assembly.sh
├── spades-sr-CHST-metagenome-assembly.sh
└── spades-sr-metagenome-assembly.sh
Noteworthy, you will have to specify which population you want to assemple as this script requires an argument.
As our clean and merged reads have standardized names with the prefix of $POP
.
Example, you want to assemble the COGE metagenome, then run this:
sbatch /scripts/assembly/metaspades-assembly.sh COGE
The process is straight as an arrow!
The script will store the metagenome as following 06-ASSEMBLY/$POP
# Set variables
POP=$1
SR_DIR="05-CLEAN-MERGED"
# Create directory for trimmed reads if not existing
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/06-ASSEMBLY/${POP}"
if [ ! -d "$outdir" ]; then
mkdir -p "$outdir"
fi
# Assembling the metagenome
spades.py \
--meta \
--only-assembler \
-k auto \
--threads 20 \
--memory 1000 \
-1 ${SR_DIR}/${POP}_R1-clean.fq.gz \
-2 ${SR_DIR}/${POP}_R2-clean.fq.gz \
-o $outdir
If you need help installing Anvio
, specifically on a HPC then check my issue here: Installing Anvio-8, Pavlins approach
If you want to understand the Anvio steps better, then check Anvi'o User Tutorial for Metagenomic Workflow
To not clutter the readme, take a peak here at my issue: REVAMP: Structured standard for MAG curation for all approaches
It explains the dynamic scripts i created to make the Anvio
and metaWRAP
parts easier and well structured.
Following previous step you should have this:
01-ALIGNMENT 02-CONTIG-DB 03-PROFILES 04-MERGED-PROFILES 05-metaWRAP
With a bunch of bin results from the different binners in 05-metaWRAP/$POP/
.
In the script folder mag-curation
, there are a bunch of scripts.
scripts/mag-curation/
├── checkm2-check.sh
├── create-contigs2bin.sh
├── gtdb-taxo.sh
├── make-contig-db.sh
├── make-profile-db-FIX.sh
├── make-profile-db.sh
├── metawrap-bin.sh
├── metawrap-refine.sh
├── reformat-fasta-fix.sh
├── reformat-fasta.sh
└── summarize-gtdbtk.sh
We want to use metawrap-refine.sh
.
For my analysis i used the thresholds COMP=50 CONT=10
.
I got to this thresholds by first testing a range of different thresholds and i found this to be the optimal one.
Check out the awesome resources on metaWRAP
's: git/metaWRAP,
And dont forget to read the paper, specifically regarding the bin_refine
module.
(https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0541-1)
With all the bins, and a cluster of aggregated bins from the bin_refinement
model from metaWRAP
. It is time to gather summary statistics of each to determine which bin meets the criteria for a quality MAG.
metaWRAP
uses CheckM1
, and older version, and believe even stopped being updated.
CheckM2
is great as the gradient boosted algorithm has been shown to perform better assesing novel microbes.
Use the checkm2-check.sh
script in the mag-curation
folder.
It will assess all the bins binned by metaWRAP for given population argument and summarize the information into a .tsv
file.
The output from CheckM2 is also stored in the corresponding population folder.
ls 06-CHECKM2/
all CHST COGE results
cat 06-CHECKM2/results/CHST-bin-report.tsv
Binner Name Completeness_General Contamination Coding_Density Contig_N50 Average_Gene_Length Genome_Size GC_Content Total_Coding_Sequences
concoct bin.0 6.14 0.0 0.508 3089 177.1764705882353 88268 0.39 85
concoct bin.1 99.02 0.0 0.733 770501 292.45879556259905 1506883 0.43 1262
concoct bin.10 6.38 0.0 0.96 5441 218.0 5441 0.45 8
Here we will use gtdb-taxo.sh
and summarize-gtdbtk.sh
to classify all bins and summarize it neatly.
Results will be in their own folder as before with 06-CHECKM2
but with all information from each bin stored in their corresponding folder.
07-GTDB-Tk/results/
all-gtdb-report.tsv CHST-gtdb-report.tsv COGE-gtdb-report.tsv
ls 07-GTDB-Tk/CHST/metawrap_50_10_bins/
align classify gtdbtk.bac120.summary.tsv gtdbtk.json gtdbtk.log gtdbtk.warnings.log identify
Here we will use BLAST+.
First create a database from all the bins you have created using make-db.sh
and then blast the database with your query using blastn-localdb.sh
.
Here, many parameters are hard coded. IF you want to change the query from doc/Erwinia_Stammerula_Wolbachia-16SrRNA.fa
. Then you only need to change the path inside the script.
../scripts/blast/
├── align-bin.sh
├── blastn-localdb.sh
├── make-dastool-db.sh
├── make-db.sh
├── make-metawrap-db.sh
└── make-nf-core-db.sh
Lastly, it is BUSCO time! You will find the script here:
../scripts/busco/
└── busco-time.sh
Here, you will utilize the path to the bin you want to check, and the taxonomical information you gained from GTDB-Tk
Here, you first annotate with bakta-annotate.sh
by setting the $POP
argument as you run the script.
Notably, the script will avoid any bins that have less than 50% completeness by utilizing the information from earlier steps with CheckM2
../scripts/bakta/
├── bakta-annotate.sh
└── summarize-bakta.sh
Then you can summarize all the bins using the $POP
argument again.
It will return a .csv
file holding information of the key genes that are required to achieve high-quality by MIMAG's guidelines.
Binner,Bin,23S,15S,5S,rRNA,tRNA,ori
concoct,bin.1,1,6,5,12,43,0
concoct,bin.2,1,1,0,2,35,1
maxbin2,bin.0,1,7,5,13,45,5
maxbin2,bin.1,1,1,0,2,35,1
metabat2,bin.1,1,1,0,2,35,1
metabat2,bin.2,1,7,5,13,45,0
metawrap,bin.1,1,6,5,12,43,0
metawrap,bin.2,1,1,0,2,35,1
I used R, i loaded in the files and did some data wrangling.
{r}
library(tidyverse)
library(flexplot)
# Load in data
all_bin <- read_delim("results/CHST-bin-report.tsv", delim = "\t")
all_tax <- read_delim("results/CHST-gtdb-report.tsv", delim = "\t")
all_mag <- read_delim("results/CHST-MAG-assembly-stats.csv", delim = ",")
all_anno <- read_delim("results/CHST-bakta-report.csv", delim = ",")
# Rename Name column in all_bin to Bin
all_bin <- all_bin %>% rename(Bin = Name)
all_tax <- all_tax %>% rename(Bin = user_genome)
all_mag <- all_mag %>% rename(Bin = bin, Binner = binner)
# Discard unwanted columns
# Merge the dataframes
merged_data <- all_bin %>%
inner_join(all_tax, by = c("Binner", "Bin")) %>%
inner_join(all_mag, by = c("Binner", "Bin")) %>%
inner_join(all_anno, by = c("Binner","Bin"))
# Change column names
merged_data <- merged_data %>%
rename(Comp. = Completeness_General,
Cont. = Contamination,
CDS_Dens. = Coding_Density,
N50 = ctg_N50,
L50 = ctg_L50,
Avg.Gene.Len. = Average_Gene_Length,
GC = GC_Content,
Tot.CDS = Total_Coding_Sequences,
Taxa_tree = classification,
Closest_ref = closest_genome_reference,
ANI = closest_genome_ani,
AF = closest_genome_af,
Contigs = n_contigs,
Contig_max = ctg_max,
Size = Genome_Size)
merged_data <- merged_data %>%
select(-closest_genome_taxonomy, -Contig_N50) %>%
select(Binner, Bin, Comp., Cont., Contigs, N50, L50, Contig_max, GC, Size, CDS_Dens.,
Tot.CDS, Avg.Gene.Len., `23S`, `15S`, `5S`, rRNA, tRNA, ori, ANI,
AF, Closest_ref, everything())
# Separate Taxa_tree into individual columns
merged_data <- merged_data %>%
separate(Taxa_tree, into = c("Domain", "Phyla", "Clade", "Order", "Family",
"Genus", "Species"), sep = ";", fill = "right")
# Remove the prefixes (d__, p__, c__, o__, f__, g__, s__) from the new columns
merged_data <- merged_data %>%
mutate(across(c(Domain, Phyla, Clade, Order, Family, Genus, Species), ~ sub("^[a-z]__", "", .)))
# Replace missing Genus values with "unspecified"
merged_data <- merged_data %>%
mutate(Genus = ifelse(is.na(Genus) | Genus == "", "unspecified", Genus))
# Metawrap dataframe
metawrap_data <- merged_data %>%
select(Binner, Bin, Comp., Cont., Contigs, N50, L50, Contig_max, GC, Size,
CDS_Dens., Tot.CDS, Avg.Gene.Len., `23S`, `15S`, `5S`, rRNA, tRNA, ori,
ANI, AF, Closest_ref, Domain, Phyla, Clade, Order, Family, Genus,
Species) %>%
filter(Binner == "metawrap") %>%
select(-Binner)
# Convert all columns to character type
metawrap_data <- metawrap_data %>% mutate(across(everything(), as.character))
# Step 1: Convert to longer format, excluding Bin
wrap_long_df <- metawrap_data %>%
pivot_longer(cols = -Bin, names_to = "Metric", values_to = "value")
# Step 2: Pivot to wider format using Bin as the new headers
metawrap_data <- wrap_long_df %>%
pivot_wider(names_from = Bin, values_from = value)
# Save the metawrap_data as a csv file
write_csv(metawrap_data, "results/COGE-metawrap-MAG-stats.csv")
Here is an example of the csv but for CHST.
Metric,bin.1,bin.2
Comp.,99.03,97.63
Cont.,0,1.96
Contigs,11,99
N50,1,19
L50,770501,17351
Contig_max,770501,70370
GC,0.43,0.35
Size,1506883,1109950
CDS_Dens.,0.733,0.859
Tot.CDS,1262,1075
Avg.Gene.Len.,292.458795562599,296.435348837209
23S,1,1
15S,6,1
5S,5,0
rRNA,12,2
tRNA,43,35
ori,0,1
ANI,N/A,97.58
AF,N/A,0.862
Closest_ref,N/A,GCF_021378375.1
Domain,Bacteria,Bacteria
Phyla,Pseudomonadota,Pseudomonadota
Clade,Gammaproteobacteria,Alphaproteobacteria
Order,Enterobacterales,Rickettsiales
Family,Enterobacteriaceae,Anaplasmataceae
Genus,Erwinia,Wolbachia
Species,,Wolbachia pipientis