lmlui / Jorg

MAG Circularization Method
GNU General Public License v3.0
33 stars 5 forks source link

Jorg

A MAG Circularization Method By Lauren Lui, Torben Nielsen, and Adam Arkin

This is a method to help circularize genomes from shotgun metagenomics data. We named this project Jorg after Jörgmungandr, aka the Midgard Serpent, from Norse mythology. It is an example of an ouroboros, a snake biting its own tail.

Contact Lauren with questions (lmlui at lbl dot gov).

Please see the manuscript "A method for achieving complete microbial genomes and better quality bins from metagenomics data" on bioRxiv and cite it if you use this method.

We thank Sean Jungbluth for helping extend the code for use in the DOE Systems Biology KnowledgeBase.

Overview of Method

This method assumes that you already have a pipeline that you like to use for assembling your metagenomes and creating bins. We like to use BBtools for trimming and read filtering, metaSPAdes for assembly, and MetaBat 2 for binning. What we describe here is steps D-G in the above figure.

Installation

System Requirements

Dependencies

Optional for final checks after circularization

Installing MIRA

Installing seqtk

We recommend installing seqtk by either of these methods:

  1. Install using conda. conda install -c bioconda seqtk
  2. Install from source. Follow the directions from the seqtk Github site.

After installing seqtk make sure that it is in your path.

Installing BWA (optional)

Information on BWA can be found here: http://bio-bwa.sourceforge.net/. You can install bwa by either of these methods

  1. Download from SourceForge (https://sourceforge.net/projects/bio-bwa/). Unzip and go into the directory. Type make.
  2. Using conda. conda install bwa.

BWA is used to determine the coverage of the bin. Most binning software should also output coverage values. However not all binning software will output the same values, especially if you are changing parameters! We recommend using BWA or checking the contig coverage values output by MetaBat 2 (<contigfile>.depth.txt file).

Installing LAST (optional)

LAST is designed to find regions of sequence similarity in large datasets. We build a LAST DB to help determine if a genome has been circularized.

  1. Installation using conda. conda install -c bioconda last
  2. See the LAST website for more information and directions for manual installation.

Installing Pilon (optional)

You can find the source and binaries for Pilon on the Pilon Github page by the Broad Institute. You can also install Pilon using conda. Although this is optional, it is useful for checking for any misasssemblies not corrected by the MIRA assembly.

Installing Infernal (optional)

Binaries and directions for installing Infernal can be found on the Infernal homepage.

One of the final checks is to find a full complement of ribosomal RNAs (16S, 23S, 5S), tRNAs (all amino acids represented) and RNase P RNA. PROKKA can be used to find tRNAs and rRNAs and tRNAscanSE can be used to find tRNAs, but you will need to use Infernal to find RNase P RNA. There are two types of bacterial RNase P RNA and two types of archaeal RNase P RNA. Each have different models in RFAM (https://rfam.org/family/RF00010).

General guide

Assemble your metagenome and bin contigs

We use BBtools for cleaning the reads, SPAdes for assembly, and MetaBat 2 for binning.

Pick a bin to circularize

Criteria for picking a bin:

  1. A bin with <10 contigs. We recommend picking a bin that has fewer than 10 contigs to increase success, but you can pick any bin and likely it will be improved using this method. We made exceptions for bins that looked promising, such as a bin with many contigs, but with one or two large contigs that comprise most of the bin’s sequence length
  2. The bin has an average coverage >30X. Higher coverage is better for success. If you didn't get the coverage from your binning output, use bwa to map reads to the longest contig to help determine the coverage. Typically this is a good coverage estimate.

Use the Jorg script to iterate mapping reads to the bin with mirabait and reassemble with MIRA. Note: typically it takes at least several hours for each iteration of Jorg to run, so jobs using a high number of iterations may take multiple days.

Overview

The Jorg script will do the following:

  1. Create a manifest file for MIRA
  2. Use mirabait to map reads to a bin
  3. Use mira to reassemble the reads
  4. Repeat steps 2 and 3 until reaching the number of iterations indicated

    Parameters

    The jorg script requires that you specify a kmer value for baiting the reads and a minimum coverage value for filtering contigs. Here are some guidelines for choosing these parameters:

    Kmer value for baiting

    mirabait requires a kmer value for baiting reads. We recommend starting with 31 or 33. Increasing the kmer value will make the read recruitment more strict if you are worried that you are picking up reads that do not belong in your bin.

    Minimum coverage

    During each iteration contigs that do not meet the minimum coverage are filtered out. We recommend starting with a minimum coverage of 75% of the top contig. Use BWA to map reads to get coverage values or see the totalAvgDepth field in the <contigfile>.depth.txt file in MetaBat 2 output.

    Using the script

jorg -b bin.186.fa -r SRX3307784_clean.fastq.gz -k 33 -c 50 -i 5 --high_contig_num no 

where 33 is the kmer value, bin.186.fa is the fasta file with contigs, SRX3307784_clean.fastq.gz are your interleaved sequencing reads that have been trimmed and quality checked, 50 is the minimum coverage value, and 5 is the number of iterations.

Check assembly stats and repeat as necessary

Circularization

You find a single contig with a significant - and exact - repeat at the ends. In addition, we required that the repeat be at least 100 nt in length, is longer than any other repeat in the contig, and does not match any of the other repeats.

See the script circle_check_using_last for an automated look at the location and length of repeats in a genome. This script requires that LAST is installed.

Idempotence

You observe no change in the assembled contigs after a round of read pair extraction and reassembly with MIRA.

Chaos

There are cases where a bin is shattered into a multitude of pieces. We are not certain as to the exact cause, but this result is likely due to misassemblies from the initial SPAdes assembly (discussed in more depth in the manuscript). Chaos appears strongly correlated with GC and tends to occur more often when the GC content is high. We believe that Chaos bins are caused by lack of read coherence in the contigs.

Use Pilon as a final check for misasssemblies

Final checks for non-coding RNAs

To look for tRNAs, rRNAs, and RNase P RNA, scan genomes with models from RFAM by using cmsearch from Infernal. If your genome is missing any of these RNAs, it might be falsely circularized.