rrwick / Metagenomics-Index-Correction

GNU General Public License v3.0
78 stars 9 forks source link

Metagenomics Index Correction

This repository contains scripts used to prepare, compare and analyse metagenomic classifications using custom index databases, either based on default NCBI or GTDB taxonomic systems. These scripts were used in the preparation of the Méric, Wick et al. (2019) manuscript entitled: Correcting index databases improves metagenomic studies.

Our custom index databases are hosted on figshare:
monash.figshare.com/projects/Metagenomics_Index_Correction/65534

Table of contents

Using custom indices

Centrifuge

We used Centrifuge (version 1.0.4) in our manuscript because it uses less RAM than alternative tools. Centrifuge indices consist of a few files named *.cf.

Please consult the Centrifuge manual for full instructions on its usage, but in brief the steps are:

  1. Download a pre-compiled index (from our our figshare project).
  2. Run Centrifuge on a paired-end Illumina read set:
    • centrifuge -p 16 -x index_dir -1 *_1.fq.gz -2 *_2.fq.gz
      -S classifications --report-file report.tsv
    • If you need a paired-end read set to try this out, here is the SRS015782 faecal metagenome from the Human Microbiome Project that we used in the study.
    • We found that Centrifuge took from 10–45 minutes, depending on the size of the sample and the speed of the computer/cluster.
  3. Generate a Kraken-style summary report:
    • centrifuge-kreport -x index_dir classifications > kreport.txt

Kraken2

Kraken2 is an alternative metagenomics classification tool. It is faster than Centrifuge but may use more RAM. Its indices consist of a few *.k2d files. Please consult the Kraken2 manual for usage instructions. As part of this work, we generated Kraken2 (and Kraken1) custom indices too, which can downloaded from our figshare project.

Scripts for analysis/index creation used in the manuscript

Requirements

To run these scripts you will need:

Custom classification indices using GTDB definitions

The tax_from_gtdb.py script generates NCBI-style taxonomy files using the GTDB definitions. It can also prepare the sequence files necessary for building a Centrifuge and/or Kraken2 database.

Usage – just make the taxonomy files:

tax_from_gtdb.py --gtdb taxonomy.tsv --nodes nodes.dmp --names names.dmp

Usage – prepare for building a Centrifuge database:

tax_from_gtdb.py --gtdb taxonomy.tsv --assemblies genomes --nodes ex.tree --names ex.name --conversion ex.conv --cat_fasta ex.fa

Usage – prepare for building a Kraken2 database:

tax_from_gtdb.py --gtdb taxonomy.tsv --assemblies genomes --nodes nodes.dmp --names names.dmp --kraken_dir kraken_genomes

Input:

Output:

Taxonomic rank counts from Centrifuge output

The count_classifications.py script allows the breakdown of read classification to different taxonomic ranks from a raw classification output from Centrifuge.

For reads with multiple classifications to different tax IDs, this script will find the LCA of the tax IDs to give a final classification. It also categories each read based on the taxonomic rank of its final classification (unclassified, root, domain, phylum, class, order, family, genus or species).

The NCBI taxonomy contains many additional and in-between ranks (e.g. subspecies, superfamily). If the read's final classification falls into one of these then the script will give it the rank of the first ancestor with a standard rank. E.g. subspecies -> species, superfamily -> order.

Usage:

count_classifications.py --centrifuge classifications --tree nodes.file --prefix out_prefix 1> read_set_summary 2> per_read_summary

Input:

Output:

Custom dereplication thresholds for GTDB assemblies

"Dereplication" refers to the threshold-based selection of representative reference genomes for phylogenetically-similar clusters, and has been used in the GTDB study to provide a reduced, less-redundant list of 28941 bacterial and archaeal reference genomes that are representative of similarity clusters on a phylogenetic tree. By default, the "dereplicated" list proposed on the downloads section of the GTDB website contains (release 86; September 2018) 27372 bacterial and 1569 archaeal genomes. As explained in the GTDB publication, the dereplication was made according to various criteria, but mostly defining two genomes as being "replicates" when their Mash distance was ≤0.05 (∼ANI of 95%).

Here, our dereplicate_assemblies.py script uses Mash-based clustering to dereplicate GTDB assemblies using a user-defined Mash threshold. This allows for a reduction in the number of assemblies (making for faster and smaller Centrifuge index databases), while still retaining a variable (user-defined) amount of genomic diversity.

Usage:

dereplicate_assemblies.py --threads 32 --threshold 0.005 all_assemblies_dir derep_assemblies_dir bac_and_arc_taxonomy_r86.tsv

Parameters:

Input:

Output:

Counting N-heavy reads in FASTQ sequencing reads

The simple read_set_n_count.py script has been written to deal with the fact that some HMP read sets are full of N-heavy reads. It takes one or more read files as input and outputs a table which shows information on the number of reads which are mostly N.

Usage:

read_set_n_count.py file.fastq > counts

Input:

Output:

Finding reads unclassified in one index but not another

The find_unclassified.py script can compare Centrifuge classifications between two indices and identify reads that are unclassified using one index but not another. In order to make sense, this script must be run using the outcomes of two classifications of the same sample using two different indices. This script was used in our analysis to reclassify using the nt database reads that had remained unclassified using the GTDB_r86_46k index.

Usage:

find_unclassified.py index1_centrifuge_output index2_centrifuge_output | grep -v unclassified > output

Input:

Output:

License

GNU General Public License, version 3