kevinVervier / metagm

Team162 metagenome pipeline
GNU General Public License v3.0
1 stars 2 forks source link

metagm: the metagenomic platform from Team162

In this page, we provide examples illustrating the different options offered by the platform.

How to cite

metagm_build: "Quality control on reference genomes was done using checkm (ref), filtering out assemblies with more than 5% contamination or less than 90% completeness. High-quality assemblies were used to build a Kraken2 (ref) database. A Bracken (ref) database was also build."

metagm_classify: "Paired-end read files were classified using Kraken2 (ref) custom database (include description? n species?). Bracken (ref) was then applied to obtain species-level metagenomics profiles."

Before starting

# add the metagm library to your path
export PATH=/nfs/team162/kv4/github/metagm/metagm/wrapper:$PATH

Then: depending of the server you use (farm3 or farm5):

Note: to leave the virtual environment, type deactivate

metagm_build module

This section describes the process of building databases for different softwares (Kraken, Bracken, Mash, ...) using a list of reference genomes. It also takes care of submitting all the jobs on farm and manage job dependencies.

metagm_build_pipeline

Inputs (mandatory)

There is two mandatory positional arguments for this function:

Options

The script metagm_build.py also offers options in the building database process:

Outputs

The output folder contains a directory for each task that is performed:

Examples

The following examples illustrate various features from the metagm_build.py script. Depending how busy farm is, these examples can take some time to run.

Quality control + taxonomic assignment on a list of genomes

metagm_build.py /nfs/team162/kv4/bin/list_example_pipeline.txt ./test --QC --taxoAssign -m 150

The command applies:

  1. Quality control on 13 genomes.
    • according to ./merge_final/ValidatedGenomes.txt, there is 12 genomes that passed QC
    • according to ./merge_final/FilteredGenomes.txt, there is 1 genome that failed QC
    • according to ./merge_final/log.txt, the genome was filtered because of XYZ
  2. taxonomic assignment on validated genomes only, using GTDB taxonomy (default).
    • the taxonomic assignment can be found in ./genome_with_gtdb_taxid.txt
  3. The taxonomic assignment step is high in memory requirement (~150Gb).

    Quality control + taxonomic assignment on a list of genomes (faster)

This example achieves the same task than previously, except it takes advantage of job parallelization. Better than running jobs on 10 genomes at a time (2 jobs in previous example), we now do batches of 2 genomes (-b flag), therefore submitting 7 smaller jobs.

#delete previous example folder
rm -rf ./test
# run the faster command
metagm_build.py /nfs/team162/kv4/bin/list_example_pipeline.txt ./test --QC --taxoAssign -m 150 -b 2

Kraken/Bracken database on a list of validated genomes

It is possible to skip the quality control and directly create Kraken database if user already checked the genomes. Additionally, the taxonomic assignment step can be ignored if user provides a taxid for each genome (third column in input list). Here, we use the curated genome list generated in the previous example.

#Run on a queue with 'hugemem' (e.g., farm3)
metagm_build.py ./genome_with_gtdb_taxid.txt ./test --KrakenDB --BrackenDB

The command builds:

  1. Kraken2 database on XYZ genomes using GTDB taxonomy.
  2. Bracken files.

Kraken/Bracken database on a list of non-QC genomes

It is possible to run all the previous steps in one single command: first, genomes are QC'ed, and then validated genomes are assigned taxonomically, finally a Kraken/Bracken database is built.

#delete previous example folder
rm -rf ./test
#Run on a queue with 'hugemem' (e.g., farm3)
metagm_build.py /nfs/team162/kv4/bin/list_example_pipeline.txt ./test --QC --KrakenDB --BrackenDB -m 100 -b 2

The command runs:

  1. Quality control on 13 genomes.
    • according to ./merge_final/ValidatedGenomes.txt, there is 12 genomes that passed QC
    • according to ./merge_final/FilteredGenomes.txt, there is 1 genome that failed QC
    • according to ./merge_final/log.txt, the genome was filtered because of XYZ
  2. taxonomic assignment on validated genomes only, using GTDB taxonomy (default).
    • the taxonomic assignment can be found in ./genome_with_gtdb_taxid.txt
  3. Kraken2 database on XYZ genomes using GTDB taxonomy.
  4. Bracken files.

Comments

metagm_classify module

This section describes the process of classifying metagenomics sequencing reads to get both taxonomic and functional profiles. It also takes care of submitting all the jobs on farm and manage job dependencies.

metagm_classify_pipeline

Inputs (mandatory)

There is two mandatory positional arguments for this function:

Options

The script metagm_classify.py offers software options for classification:

Outputs

The output folder contains a directory for each task that is performed:

Comments

Available Kraken/Bracken databases:

* __current__ Kraken1019: using the Mgnify curated collection (31,555 genomes) from bacteria and archea.  The taxon IDs have been assigned using GTDB.
* __current__ Kraken1019_with_nonBact_nonArch: same as Kraken1019, but also contains 131 gut-associated fungi, small eukaryota and other parasites.
    * Kraken0419_kraken2_taxo_resolved: build with Kraken2, using the HBC + BGI collections and ~300 public genomes and the taxon IDs have been manually curated using GTDB rather than the default NCBI.
    * InternalKraken_OCT2017: HBC + public genomes (@Sam), relied on the old QC for reference genomes
    * RefSeq94n.msh: Mash sketch of the entire RefSeq (version 94)
    * taxonomy/ : custom taxonomy built from GTDB for Bacteria and Archea, and saved in a NCBI format for Kraken. It also contains few viruses and eukaryotes.

Examples

The following examples illustrate various features from the metagm_classify.py script. Depending how busy farm is, these examples can take some time to run.

Bracken prediction for a list of metagenomes

metagm_classify.py --BrackenDB /nfs/pathogen005/team162/Kraken1019 /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline -b 2

The command returns:

  1. in ./test_classify_pipeline/Bracken/*S.bracken, Bracken files on 6 metagenomes at species level (default).
    • also contains the merged table with all samples ./test_classify_pipeline/Bracken/merged_S.bracken
  2. in ./test_classify_pipeline/Kraken/*.out, Kraken2 files on 6 metagenomes.
    • also contains the two merged tables with all samples from Kraken output:
      • the raw read counts: ./test_classify_pipeline/Kraken/merged_raw.txt
      • the normalized abundance./test_classify_pipeline/Kraken/merged_norm.txt
  3. in ./test_classify_pipeline/Kraken/*bracken.out, Kraken2 files on 6 metagenomes, without unclassified reads.

Bracken prediction for a list of metagenomes at genus level

Very similar command to the previous example, except we introduce the --BrackenRank to specify the rank Bracken should use:

metagm_classify.py --BrackenDB /nfs/pathogen005/team162/Kraken1019 /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline_genus -b 2 --BrackenRank G

The command returns:

  1. in ./test_classify_pipeline_genus/Kraken/*.out, Kraken2 files on 6 metagenomes.
    • also contains the two merged tables with all samples from Kraken output: the raw read counts: ./test_classify_pipeline_genus/Kraken/merged_raw.txt and the normalized abundance./test_classify_pipeline_genus/Kraken/merged_norm.txt
    • please note that these files hsould be exactly the same that in ./test_classify_pipeline/Kraken/ as the Kraken step considers all ranks
  2. in ./test_classify_pipeline_genus/Bracken/*G.bracken, Bracken files on 6 metagenomes at species level (default).
    • also contains the merged table with all samples ./test_classify_pipeline_genus/Bracken/merged_G.bracken

Detecting genomes in a list of metagenomes with Mash

For this example, Refseq genomes are detected in metagenome samples. Please note that this exercise is not standard and still experimental, so be careful on the way you use the results. Likely, the detected genomes are from abundant organisms too.

metagm_classify.py --MashDB /nfs/pathogen005/team162/RefSeq94n.msh /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline_mash -b 2

The command returns:

  1. in ./test_classify_pipeline_mash/Mash/*_screen.tab, Mash sketch on 6 metagenomes.
  2. in ./test_classify_pipeline_mash/Mash/merged_mash.txt, merged best hits detected with some confidence in metagenomes

Functional characterization of a list of metagenomes with Humann2

This pipeline feature relies on

metagm_classify.py --functional /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline_functional -b 2

The command returns:

Taxonomy

Taxonomic assignment

How to build a tree in NCBI format using gtdb metadata

Kraken software relies on taxonomic information presented in the 'NCBI format' (a pair of nodes.dmp and names.dmp). Unfortunately, gtdb does not provide (yet?) its taxonomy in such format. Therefore, we need to build a GTDB-based taxonomy that can be saved in the NCBI format.

  1. download all the gtdb archeal metadata and bacterial metadata containing all taxonomic paths for every genome found in the database.
  2. extract unique paths in the metadata files to make the following steps faster
    # merge bacterial and archeal metadata
    cat ar122_metadata.tsv bac120_metadata.tsv > bac_ar_metadata.tsv
    # get unique taxonomic paths
    cut -f17 bac_ar_metadata.tsv | sort -u > tmp.txt
    # remove last line that contains header
    head -n -1 tmp.txt > taxo_path_gtdb_bac120_ar122_unique.txt
    #remove temporary file
    rm tmp.txt
  3. create a Taxonomy tree in Python
#import library
import pickle # save binary object, like trees
import sys  # system command library
sys.path.append('/nfs/team162/kv4/github/metagm') # add the metagm library to your session
from metagm.phylogeny.TaxonomyTree import TaxonomyTree # class to generate trees
# this class only needs a list of taxonomic paths and will generate a taxonomy in NCBI format
tree = TaxonomyTree('taxo_path_gtdb_bac120_ar122_unique.txt')
# save the tree in NCBI format at the given location
tree.saveNCBIFormat('taxonomy_bac_ar')
# also save the tree in a Pickle format (Python binary) for later
tree.savePickleFormat('/nfs/team162/kv4/bin/gtdb_metadata/taxonomy_bac_ar/taxo.pyc')

How to add nodes to an existing taxonomic tree

If a tree needs to be updated by adding new nodes, it is not necessary to re-build it from scratch. User can provide a Pickle tree and new taxonomic paths:

#import library
import pickle  # load binary object, like trees
import csv # read csv files
import sys  # system command library
sys.path.append('/nfs/team162/kv4/github/metagm') # add the metagm library to your session
from metagm.phylogeny.TaxonomyTree import TaxonomyTree # class to generate trees

#load existing tree to add nodes
with open('taxonomy_bac_ar/taxo.pyc', "rb") as input_file:
    tree = pickle.load(input_file)

#Here we want to add non bacterial and non archeal nodes (fungi, virus and other eukaryotes)
extraNodes = '/nfs/team162/kv4/Kraken1019/nonBacterial_taxopath.txt'
with open(extraNodes) as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    for i, row in enumerate(reader):
        print(row[0])
        # add a node if it is new
        tree.add_node(row[0])

# save the updated tree in NCBI format at the given location
tree.saveNCBIFormat('taxonomy')
# also save the tree in a Pickle format (Python binary) for later
tree.savePickleFormat('taxonomy/taxo.pyc')

Statistical analysis

This section provides some R snippets to do analysis using the metagm_classify output files.

More details are given in the R vignette '.Rmd', also found in this repository.

MySQL Knowledge database

Connection Setup

dbviz_new_1

  1. Select Use Wizard, then choose a name for the connection (example: microbiota_db), then press Next

dbviz_new_2

  1. Select MySQL as Database driver, then press Next

dbviz_new_3

  1. Fill all the informations as shown in the screenshot (password not included), then press Finish

dbviz_new_4

  1. Then, user can explore the different tables in the database (example: isolates)

dbviz_new_6

Genomes table

All the genome assemblies can be found in /lustre/scratch118/infgen/team162/kv4/working_list_genomes_kraken.

Miscalleneous

Other utility functions/scripts

Quality control

This section describes the content of quality control applied to genome assemblies in metagm_build module. Here are the different steps with their corresponding default values:

Functional analysis of metagenomes

Strain-level analysis using PanPhlan

If you want to run PanPhlan tool for a specific species and get strain-level resolution, you might want to check if they provide a database for this species here. If they do, we already have them downloaded in /lustre/scratch118/infgen/team162/ys4/panphlan/panphlan_db. If not, you could build your own folowwing this tutorial.

Extract 16S for a list of assemblies

There is a wrapper script for [rnammer]() found in /nfs/team162/kv4/bin/run_rnammer.sh. It takes a list of paths to assembly in a text file as well as a output directory to store the results. Example:

sh /nfs/team162/kv4/bin/run_rnammer.sh /nfs/team162/kv4/bin/test_rnammer.txt ./16S_extract/

Python library classes

This section describes the built-in classes created in the metagm library. They can be used to improve further Python script development or add features in an efficient manner.

BacterialGenome() class:

This class creates a genome object from a sequence file and stores various information about it:

Examples

#in python3

#load libraries
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.BacterialGenome import BacterialGenome

# create a genome using its sequence
g = BacterialGenome(genomefilepath='/lustre/scratch118/infgen/team162/kv4/working_list_genomes_kraken/GCA_900129655.1_IMG-taxon_2695420967_annotated_assembly_genomic.fna',  genomename='Bacteroides clarus GCF_900129655', taxid='626929')

# it will run QC on it first (default)
# qc=False can be used in the previous command to avoid running QC

# then, asking if the genome passed QC
g.is_valid()
# True means the genome passed QC

# then, extract the 16S sequence of this genome (rnammer)
g.get_16s_sequence()
g.seq16s # it is a SeqRecord object
len(g.seq16s) # one 16S copy
#get the assocaited 16S fasta sequence
print(g.seq16s["rRNA_FQWK01000014.1_260-1774_DIR+"].format("fasta"))

GenomeList() class:

This class creates a object made of multiple BacterialGenome and makes it easy to serialize analyses. Its components are:

Examples

#in python3.6

#load libraries
import csv
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.BacterialGenome import BacterialGenome
from metagm.sequences.GenomeList import GenomeList

# create list of genomes from an input file
print('Loading list of genomes...')
glist = GenomeList(inputfile='/nfs/team162/kv4/bin/test_list_genomes.txt')
with open(glist.inputfile) as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    for i, row in enumerate(reader):
        g = BacterialGenome(genomefilepath=row[0], genomename=row[1], qc=False)
        # append to existing list
        glist.append(g, checkvalid=False)
#get list length -->10
len(glist)

#get name of first genome in list
glist.genomelist[0].genomename

#get path to third genome in list
glist.genomelist[3].genomefilepath

#remove the first genome in the list using its name
glist.remove(genomeidentifier=glist.genomelist[0].genomename)
#get list length --> 9
len(glist)

#get name of first genome in list --> different than previously
glist.genomelist[0].genomename

#get gtdb taxid for a list of genomes (submit 1 job on long queue, might take some time)
glist.list_assign_taxo_gtdb()
# create a 'myTaxids.txt' file with the results and store the results in the list also
glist.taxidfile
# get the list of taxids attached to the genomes
glist.taxidlist

Metagenome() class:

This class stores a metagenome sample: Its components are:

Examples

#in python3.6

#load libraries
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.Metagenome import Metagenome

# create a metagenome object from paired fastq
mg = Metagenome(metagenomename='ERR2835563', metafastqfile1='/lustre/scratch118/infgen/pathogen/pathpipe/pathogen_prok_external/seq-pipelines/human/gut_metagenome/TRACKING/235/PS.170.S4.5.metagenomic/SLX/ERX2842313/ERR2835563/ERR2835563_1.fastq.gz',metafastqfile2='/lustre/scratch118/infgen/pathogen/pathpipe/pathogen_prok_external/seq-pipelines/human/gut_metagenome/TRACKING/235/PS.170.S4.5.metagenomic/SLX/ERX2842313/ERR2835563/ERR2835563_2.fastq.gz')

# builder automatically detect if data are paired-end
mg.pairedFlag
# builder automatically detect if data are zipped
mg.zippedFlag

MetagenomeList() class:

This class creates a object made of multiple Metagenome and makes it easy to serialize analyses. Its components are:

Examples

#in python3.6

#load libraries
import csv
import sys
import os
import re
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.Metagenome import Metagenome
from metagm.sequences.MetagenomeList import MetagenomeList

# create list of metagenomes from an input file
mglist = MetagenomeList()
with open('/nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt') as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    for i, row in enumerate(reader):
        # retrieve the paired-end fastq from folder path only
        files = [f for f in os.listdir(os.path.dirname(row[0])) if re.match(os.path.basename(row[0])+'_[12].fastq.gz$', f)]
        mg=Metagenome(metagenomename=os.path.basename(row[0]),metafastqfile1=os.path.dirname(row[0])+'/'+files[0], metafastqfile2=os.path.dirname(row[0])+'/'+files[1])
        # append
        mglist.append(mg)

#get list length --> 5
len(mglist)

#classify these samples with Kraken (submit jobs only and return the job IDs)
mglist.classify_kraken('/nfs/pathogen005/team162/Kraken0419_kraken2_taxo_resolved')

#results are stored in './tmp'

TaxonomyTree() class:

This class stores a tree structure using the ETE3 toolkit. Its components are:

Examples

#in python3.6

#load libraries
import pickle
import csv
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.phylogeny.TaxonomyTree import TaxonomyTree

# create a tree object from a list of 10 paths
tree = TaxonomyTree('/nfs/team162/kv4/bin/test_list_taxopaths.txt')

tree.maxtaxid
#51 nodes were created

#print the tree in the terminal (ASCII format)
tree.printAscii()

#save tree in NCBI format
tree.saveNCBIFormat('./')

#save tree in Pickle format (can be loaded in any Python session)
tree.savePickleFormat('./taxo.pyc')

TODO

This section includes features that are either under-development, or not implemented yet: