A MAG Circularization Method By Lauren Lui, Torben Nielsen, and Adam Arkin
This is a method to help circularize genomes from shotgun metagenomics data. We named this project Jorg after Jörgmungandr, aka the Midgard Serpent, from Norse mythology. It is an example of an ouroboros, a snake biting its own tail.
Contact Lauren with questions (lmlui at lbl dot gov).
Please see the manuscript "A method for achieving complete microbial genomes and better quality bins from metagenomics data" on bioRxiv and cite it if you use this method.
We thank Sean Jungbluth for helping extend the code for use in the DOE Systems Biology KnowledgeBase.
This method assumes that you already have a pipeline that you like to use for assembling your metagenomes and creating bins. We like to use BBtools for trimming and read filtering, metaSPAdes for assembly, and MetaBat 2 for binning. What we describe here is steps D-G in the above figure.
We recommend installing seqtk by either of these methods:
conda install -c bioconda seqtk
After installing seqtk make sure that it is in your path.
Information on BWA can be found here: http://bio-bwa.sourceforge.net/. You can install bwa by either of these methods
make
.conda install bwa
.BWA is used to determine the coverage of the bin. Most binning software should also output coverage values. However not all binning software will output the same values, especially if you are changing parameters! We recommend using BWA or checking the contig coverage values output by MetaBat 2 (<contigfile>.depth.txt
file).
LAST is designed to find regions of sequence similarity in large datasets. We build a LAST DB to help determine if a genome has been circularized.
conda install -c bioconda last
You can find the source and binaries for Pilon on the Pilon Github page by the Broad Institute. You can also install Pilon using conda. Although this is optional, it is useful for checking for any misasssemblies not corrected by the MIRA assembly.
Binaries and directions for installing Infernal can be found on the Infernal homepage.
One of the final checks is to find a full complement of ribosomal RNAs (16S, 23S, 5S), tRNAs (all amino acids represented) and RNase P RNA. PROKKA can be used to find tRNAs and rRNAs and tRNAscanSE can be used to find tRNAs, but you will need to use Infernal to find RNase P RNA. There are two types of bacterial RNase P RNA and two types of archaeal RNase P RNA. Each have different models in RFAM (https://rfam.org/family/RF00010).
We use BBtools for cleaning the reads, SPAdes for assembly, and MetaBat 2 for binning.
Criteria for picking a bin:
The Jorg script will do the following:
mirabait
to map reads to a binmira
to reassemble the readsRepeat steps 2 and 3 until reaching the number of iterations indicated
The jorg
script requires that you specify a kmer value for baiting the reads and a minimum coverage value for filtering contigs. Here are some guidelines for choosing these parameters:
mirabait requires a kmer value for baiting reads. We recommend starting with 31 or 33. Increasing the kmer value will make the read recruitment more strict if you are worried that you are picking up reads that do not belong in your bin.
During each iteration contigs that do not meet the minimum coverage are filtered out. We recommend starting with a minimum coverage of 75% of the top contig. Use BWA to map reads to get coverage values or see the totalAvgDepth
field in the <contigfile>.depth.txt
file in MetaBat 2 output.
manifest_template.conf
file is in the same directory.jorg -b bin.186.fa -r SRX3307784_clean.fastq.gz -k 33 -c 50 -i 5 --high_contig_num no
where 33 is the kmer value, bin.186.fa is the fasta file with contigs, SRX3307784_clean.fastq.gz are your interleaved sequencing reads that have been trimmed and quality checked, 50 is the minimum coverage value, and 5 is the number of iterations.
mirabait.log
- log file from mirabait
. This is extremely helpful if you are getting errors at this step.mira.log
- log file from mira
.iterations.txt
- contig stats for the assembly from each iterationIterations
with the assemblies from each iteration<binID>.out.fasta
- Assembly from the last iterationThe jorg
script will output a file called iterations.txt
with contig stats. Check this file to see if the contigs are getting longer. You may also want to remove contigs that appear to be contamination, e.g. those that are short and are not extending, before the next set of iterations. If you need to continue iterating, use the <binID>.out.fasta
file as input to the next round.
Note that sometimes the best assembly might not be the final iteration.
Stop when you reach one of these three outcomes:
You find a single contig with a significant - and exact - repeat at the ends. In addition, we required that the repeat be at least 100 nt in length, is longer than any other repeat in the contig, and does not match any of the other repeats.
See the script circle_check_using_last
for an automated look at the location and length of repeats in a genome. This script requires that LAST is installed.
You observe no change in the assembled contigs after a round of read pair extraction and reassembly with MIRA.
There are cases where a bin is shattered into a multitude of pieces. We are not certain as to the exact cause, but this result is likely due to misassemblies from the initial SPAdes assembly (discussed in more depth in the manuscript). Chaos appears strongly correlated with GC and tends to occur more often when the GC content is high. We believe that Chaos bins are caused by lack of read coherence in the contigs.
After circularizing a contig, we do final checks for misassemblies with Pilon. We use Pilon on the contig and then we rotate it by half the length to ensure that the ends were in the middle and apply Pilon again. See the script fasta_rotate
. We rotate the genomes because Pilon is not capable of covering the ends of a contig.
Pilon has a detailed Github wiki page that details how to use it and understand its output.
To look for tRNAs, rRNAs, and RNase P RNA, scan genomes with models from RFAM by using cmsearch
from Infernal. If your genome is missing any of these RNAs, it might be falsely circularized.