metagm: the metagenomic platform from Team162

In this page, we provide examples illustrating the different options offered by the platform.

How to cite

metagm_build: "Quality control on reference genomes was done using checkm (ref), filtering out assemblies with more than 5% contamination or less than 90% completeness. High-quality assemblies were used to build a Kraken2 (ref) database. A Bracken (ref) database was also build."

metagm_classify: "Paired-end read files were classified using Kraken2 (ref) custom database (include description? n species?). Bracken (ref) was then applied to obtain species-level metagenomics profiles."

Before starting

Currently, it is preferable to run the scripts on farm3, as there is no large memory queue elsewhere (especially metagm_build).
To use all the scripts described in this page, users need to add the following to their ~/.profile file:

# add the metagm library to your path
export PATH=/nfs/team162/kv4/github/metagm/metagm/wrapper:$PATH

Then: depending of the server you use (farm3 or farm5):

optional only if working on farm5 or pcs6 , load useful modules: module load /nfs/users/nfs_k/kv4/modules/metagm
Add the following line in your ~/.profile: alias metagm_env='source /nfs/team162/kv4/bin/metagm_env/bin/activate'
Add the following line in your ~/.profile: alias metagm_env_farm5='source /nfs/team162/kv4/bin/metagm_env_farm5/bin/activate'
Source it: source ~/.profile
(farm3) Run metagm_env in the terminal it should slightly change the prompt (metagm_env) in front of it
(farm5) Run metagm_env_farm5 in the terminal it should slightly change the prompt (metagm_env) in front of it

Note: to leave the virtual environment, type deactivate

metagm_build module

This section describes the process of building databases for different softwares (Kraken, Bracken, Mash, ...) using a list of reference genomes. It also takes care of submitting all the jobs on farm and manage job dependencies.

metagm_build_pipeline

Inputs (mandatory)

There is two mandatory positional arguments for this function:

a genomes list (text file):
- mandatory first column contains the absolute paths to genome assemblies (.fa or .fna).
- For internal genomes, please use symlinks path not the pathogen system path. The reason being that all assemblies are named contigs.fa in the system and therefore are not unique names.
- User can get symlinks from pf assembly -t file -i YOUR_LIST_OF_LANEIDS.txt -l MY_SYMLINK_FOLDER
- second column contains the genome names (if not provided, the file name will be used)
- third column contains the taxids (if not provided, a taxonomic assignment step is performed)
an output folder to store all the files produced by the script

Options

The script metagm_build.py also offers options in the building database process:

-h: help display
-v: verbose mode for additional details on each step
--QC: run quality control on the list of genomes before building any database. If not done, the scirpt assumes that all the genomes have already been cheked.
--taxoAssign: run taxonomic assignment step on all the genomes. It will automatically be done if Kraken/Bracken databases are built.
--KrakenDB: build a Kraken2 database using the list of genomes
--BrackenDB: build a Bracken database using the list of genomes. Requires a Kraken database to exist, and will therefore automatically creates one
--MashDB: create a Mash sketch of all the genomes.
--ncbi: rely on NCBI taxonomy instead of gtdb (default: false)
-b: define the number of genomes to be analyzed in each batch (default: 10)
-t: define the number of threads used in each job (default: 2)
-q: define to which queue the jobs are submitted (default: long)
-m: define the amount of memory requested for each job (default: 64)
--maxcontamination: define the maximal value on checkm contamination to filter out a genome during QC (default: 5)
--mincompleteness: define the minimal value on checkm completeness to filter out a genome during QC (default: 90)

Outputs

The output folder contains a directory for each task that is performed:

output/merge_final contains quality control (QC) results for all the genomes
- output/merge_final/ValidatedGenomes.txt is the list of all genomes that pass QC
- output/merge_final/FilteredGenomes.txt is the list of all genomes that fail QC
- output/merge_final/log.txt provides details on why a genome failed QC
output/tmp$i folders contain the checkm lineage_wf output for the batch i of genomes (if parallelized)
output/Kraken contains the Kraken2 database, as well as the Bracken files (if requested)
output/Mash contains the Mash sketch file
output/genome_with_[gtdb|ncbi]_taxid.txt contains the list of taxids assigned to the genomes list if a taxonomic assignment step is performed

Examples

The following examples illustrate various features from the metagm_build.py script. Depending how busy farm is, these examples can take some time to run.

Quality control + taxonomic assignment on a list of genomes

metagm_build.py /nfs/team162/kv4/bin/list_example_pipeline.txt ./test --QC --taxoAssign -m 150

The command applies:

Quality control on 13 genomes.
- according to ./merge_final/ValidatedGenomes.txt, there is 12 genomes that passed QC
- according to ./merge_final/FilteredGenomes.txt, there is 1 genome that failed QC
- according to ./merge_final/log.txt, the genome was filtered because of XYZ
taxonomic assignment on validated genomes only, using GTDB taxonomy (default).
- the taxonomic assignment can be found in ./genome_with_gtdb_taxid.txt
The taxonomic assignment step is high in memory requirement (~150Gb).

Quality control + taxonomic assignment on a list of genomes (faster)

This example achieves the same task than previously, except it takes advantage of job parallelization. Better than running jobs on 10 genomes at a time (2 jobs in previous example), we now do batches of 2 genomes (-b flag), therefore submitting 7 smaller jobs.

#delete previous example folder
rm -rf ./test
# run the faster command
metagm_build.py /nfs/team162/kv4/bin/list_example_pipeline.txt ./test --QC --taxoAssign -m 150 -b 2

Kraken/Bracken database on a list of validated genomes

It is possible to skip the quality control and directly create Kraken database if user already checked the genomes. Additionally, the taxonomic assignment step can be ignored if user provides a taxid for each genome (third column in input list). Here, we use the curated genome list generated in the previous example.

#Run on a queue with 'hugemem' (e.g., farm3)
metagm_build.py ./genome_with_gtdb_taxid.txt ./test --KrakenDB --BrackenDB

The command builds:

Kraken2 database on XYZ genomes using GTDB taxonomy.
- generate multiple files described in Kraken2 manual
Bracken files.
- generate multiple files described in Bracken manual

Kraken/Bracken database on a list of non-QC genomes

It is possible to run all the previous steps in one single command: first, genomes are QC'ed, and then validated genomes are assigned taxonomically, finally a Kraken/Bracken database is built.

#delete previous example folder
rm -rf ./test
#Run on a queue with 'hugemem' (e.g., farm3)
metagm_build.py /nfs/team162/kv4/bin/list_example_pipeline.txt ./test --QC --KrakenDB --BrackenDB -m 100 -b 2

The command runs:

Quality control on 13 genomes.
- according to ./merge_final/ValidatedGenomes.txt, there is 12 genomes that passed QC
- according to ./merge_final/FilteredGenomes.txt, there is 1 genome that failed QC
- according to ./merge_final/log.txt, the genome was filtered because of XYZ
taxonomic assignment on validated genomes only, using GTDB taxonomy (default).
- the taxonomic assignment can be found in ./genome_with_gtdb_taxid.txt
Kraken2 database on XYZ genomes using GTDB taxonomy.
- generate multiple files described in Kraken2 manual
Bracken files.
- generate multiple files described in Bracken manual

Comments

You can access all the information in the command help metagm_build.py -h
Building a Kraken database might require a large memory amount (>200Gb). The function will automatically submit the final kraken_build job on hugemem queue which means users need to run it on farm3 (or farm4/5 when this queue will be available).
If neither --QC, --Kraken, --Bracken, or --Mash is provided, the metagm_build.py is not going to do anything.
The taxonomic assignment step is high in memory requirement (~100Gb).

metagm_classify module

This section describes the process of classifying metagenomics sequencing reads to get both taxonomic and functional profiles. It also takes care of submitting all the jobs on farm and manage job dependencies.

metagm_classify_pipeline

Inputs (mandatory)

There is two mandatory positional arguments for this function:

a metagenomes list (one-column text file):
- paired-end data(gzipped): only provide path without extension, like /lustre/scratch118/infgen/team162/kv4/healthy/ERR209654. The script will automatically seek for /lustre/scratch118/infgen/team162/kv4/healthy/ERR209654_1.fastq.gz and /lustre/scratch118/infgen/team162/kv4/healthy/ERR209654_2.fastq.gz.
- single-end data: provide whole path to fastq files
- the script will automatically detect if data are gzipped
an output folder to store all the results produced by the script

Options

The script metagm_classify.py offers software options for classification:

-h: help display
-v: verbose mode for additional details on each step
--KrakenDB: path to a Kraken2 database folder (example: /nfs/pathogen005/team162/Kraken1019)
--BrackenDB: path to a Kraken2 database folder with Bracken files in it (example: /nfs/pathogen005/team162/Kraken1019)
--BrackenRank: taxonomic rank for Bracken output (default: S for species) [options: D,P,C,O,F,G,S]
--MashDB: path to a .msh Mash sketch (example /nfs/pathogen005/team162/RefSeq94n.msh)
--functional: run a functional characterization of the metagenomes rather than a taxonomic one (default: false)
--noMetaPhlan: skip MetaPhlan2 step in functional analysis (default: false). will only work if these files already exist.
-b: define the number of metagenomes to be analyzed in each batch (default: 10)
-t: define the number of threads used in each job (default: 2)
-q: define to which queue the jobs are submitted (default: long)
-m: define the amount of memory requested for each job (default: 64)
--merge: will automatically merge all the output files from one method in a single table file (default: true)
- for Kraken, it relies on the /nfs/team162/kv4/Kraken2Table.pl script to generate two merged tables: a raw count and a normalized abundance
- for Bracken, it relies on the combine_bracken_outputs.py script to generate a single table mixing raw and relative abundance
- for Mash, it relies on the /nfs/team162/kv4/bin/parse_mash.R script to report high-confidence hits for each sample
- for functional analysis, data will always be merged in tables at different levels (gene families, pathways, GO categories, ...)

Outputs

The output folder contains a directory for each task that is performed:

output/Kraken contains the Kraken2 report
output/Bracken contains the Bracken files
output/Mash contains the Mash sketch file
output/MetaPhlan2 contains all the output files generated by MetaPhlan2 during the functional analysis
output/Humann2 contains all the output files generated by Humann2 during the functional analysis

Comments

The --BrackenDB flag will automatically trigger --KrakenDB with the same database, as Bracken requires Kraken output files to create its output files.

Available Kraken/Bracken databases:

* __current__ Kraken1019: using the Mgnify curated collection (31,555 genomes) from bacteria and archea.  The taxon IDs have been assigned using GTDB.
* __current__ Kraken1019_with_nonBact_nonArch: same as Kraken1019, but also contains 131 gut-associated fungi, small eukaryota and other parasites.
    * Kraken0419_kraken2_taxo_resolved: build with Kraken2, using the HBC + BGI collections and ~300 public genomes and the taxon IDs have been manually curated using GTDB rather than the default NCBI.
    * InternalKraken_OCT2017: HBC + public genomes (@Sam), relied on the old QC for reference genomes
    * RefSeq94n.msh: Mash sketch of the entire RefSeq (version 94)
    * taxonomy/ : custom taxonomy built from GTDB for Bacteria and Archea, and saved in a NCBI format for Kraken. It also contains few viruses and eukaryotes.

Examples

The following examples illustrate various features from the metagm_classify.py script. Depending how busy farm is, these examples can take some time to run.

Bracken prediction for a list of metagenomes

metagm_classify.py --BrackenDB /nfs/pathogen005/team162/Kraken1019 /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline -b 2

The command returns:

in ./test_classify_pipeline/Bracken/*S.bracken, Bracken files on 6 metagenomes at species level (default).
- also contains the merged table with all samples ./test_classify_pipeline/Bracken/merged_S.bracken
in ./test_classify_pipeline/Kraken/*.out, Kraken2 files on 6 metagenomes.
- also contains the two merged tables with all samples from Kraken output:
  - the raw read counts: ./test_classify_pipeline/Kraken/merged_raw.txt
  - the normalized abundance./test_classify_pipeline/Kraken/merged_norm.txt
in ./test_classify_pipeline/Kraken/*bracken.out, Kraken2 files on 6 metagenomes, without unclassified reads.

Bracken prediction for a list of metagenomes at genus level

Very similar command to the previous example, except we introduce the --BrackenRank to specify the rank Bracken should use:

metagm_classify.py --BrackenDB /nfs/pathogen005/team162/Kraken1019 /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline_genus -b 2 --BrackenRank G

The command returns:

in ./test_classify_pipeline_genus/Kraken/*.out, Kraken2 files on 6 metagenomes.
- also contains the two merged tables with all samples from Kraken output: the raw read counts: ./test_classify_pipeline_genus/Kraken/merged_raw.txt and the normalized abundance./test_classify_pipeline_genus/Kraken/merged_norm.txt
- please note that these files hsould be exactly the same that in ./test_classify_pipeline/Kraken/ as the Kraken step considers all ranks
in ./test_classify_pipeline_genus/Bracken/*G.bracken, Bracken files on 6 metagenomes at species level (default).
- also contains the merged table with all samples ./test_classify_pipeline_genus/Bracken/merged_G.bracken

Detecting genomes in a list of metagenomes with Mash

For this example, Refseq genomes are detected in metagenome samples. Please note that this exercise is not standard and still experimental, so be careful on the way you use the results. Likely, the detected genomes are from abundant organisms too.

metagm_classify.py --MashDB /nfs/pathogen005/team162/RefSeq94n.msh /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline_mash -b 2

The command returns:

in ./test_classify_pipeline_mash/Mash/*_screen.tab, Mash sketch on 6 metagenomes.
in ./test_classify_pipeline_mash/Mash/merged_mash.txt, merged best hits detected with some confidence in metagenomes

Functional characterization of a list of metagenomes with Humann2

This pipeline feature relies on

metagm_classify.py --functional /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline_functional -b 2

The command returns:

Taxonomy

Taxonomic assignment

In this section, we describe the process used to assign a given genome to the current taxonomy. As mentioned in the section 'metagm_build module', users have the option to rely on either the gtdb taxonomy done with gtdb-tk classify_wf function.
Current taxonomic tree build from GTDB metadata is stored here: /nfs/pathogen005/team162/taxonomy
Non bacterial and non archeal organisms cannot be assigned using gtdb and need to be manually annotated using /nfs/pathogen005/team162/taxonomy/names.dmp file

How to build a tree in NCBI format using gtdb metadata

Kraken software relies on taxonomic information presented in the 'NCBI format' (a pair of nodes.dmp and names.dmp). Unfortunately, gtdb does not provide (yet?) its taxonomy in such format. Therefore, we need to build a GTDB-based taxonomy that can be saved in the NCBI format.

download all the gtdb archeal metadata and bacterial metadata containing all taxonomic paths for every genome found in the database.

extract unique paths in the metadata files to make the following steps faster

# merge bacterial and archeal metadata
cat ar122_metadata.tsv bac120_metadata.tsv > bac_ar_metadata.tsv
# get unique taxonomic paths
cut -f17 bac_ar_metadata.tsv | sort -u > tmp.txt
# remove last line that contains header
head -n -1 tmp.txt > taxo_path_gtdb_bac120_ar122_unique.txt
#remove temporary file
rm tmp.txt

create a Taxonomy tree in Python

#import library
import pickle # save binary object, like trees
import sys  # system command library
sys.path.append('/nfs/team162/kv4/github/metagm') # add the metagm library to your session
from metagm.phylogeny.TaxonomyTree import TaxonomyTree # class to generate trees
# this class only needs a list of taxonomic paths and will generate a taxonomy in NCBI format
tree = TaxonomyTree('taxo_path_gtdb_bac120_ar122_unique.txt')
# save the tree in NCBI format at the given location
tree.saveNCBIFormat('taxonomy_bac_ar')
# also save the tree in a Pickle format (Python binary) for later
tree.savePickleFormat('/nfs/team162/kv4/bin/gtdb_metadata/taxonomy_bac_ar/taxo.pyc')

How to add nodes to an existing taxonomic tree

If a tree needs to be updated by adding new nodes, it is not necessary to re-build it from scratch. User can provide a Pickle tree and new taxonomic paths:

#import library
import pickle  # load binary object, like trees
import csv # read csv files
import sys  # system command library
sys.path.append('/nfs/team162/kv4/github/metagm') # add the metagm library to your session
from metagm.phylogeny.TaxonomyTree import TaxonomyTree # class to generate trees

#load existing tree to add nodes
with open('taxonomy_bac_ar/taxo.pyc', "rb") as input_file:
    tree = pickle.load(input_file)

#Here we want to add non bacterial and non archeal nodes (fungi, virus and other eukaryotes)
extraNodes = '/nfs/team162/kv4/Kraken1019/nonBacterial_taxopath.txt'
with open(extraNodes) as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    for i, row in enumerate(reader):
        print(row[0])
        # add a node if it is new
        tree.add_node(row[0])

# save the updated tree in NCBI format at the given location
tree.saveNCBIFormat('taxonomy')
# also save the tree in a Pickle format (Python binary) for later
tree.savePickleFormat('taxonomy/taxo.pyc')

Statistical analysis

This section provides some R snippets to do analysis using the metagm_classify output files.

More details are given in the R vignette '.Rmd', also found in this repository.

MySQL Knowledge database

Connection Setup

This section describes how to log in the lab mySQL database.
The screenshots refer to the DbVisualizer software, free to install.
The free version gives you access to read-only features (no query).
For obvious security reasons, the password is not provided here and needs to be requested to Nick/Hilary.
1. open the software and in the left margin Databases, right-click for Create Database Connection

dbviz_new_1

Select Use Wizard, then choose a name for the connection (example: microbiota_db), then press Next

dbviz_new_2

Select MySQL as Database driver, then press Next

dbviz_new_3

Fill all the informations as shown in the screenshot (password not included), then press Finish

dbviz_new_4

Then, user can explore the different tables in the database (example: isolates)

dbviz_new_6

Genomes table

All the genome assemblies can be found in /lustre/scratch118/infgen/team162/kv4/working_list_genomes_kraken.

Miscalleneous

Other utility functions/scripts

Quality control

This section describes the content of quality control applied to genome assemblies in metagm_build module. Here are the different steps with their corresponding default values:

the total length of the assembly has to be lower than 8Mbp
the total number of contigs has to be lower 400
checkM is then run on the assembly if it passed the previous two steps:
- more than 90% completeness is required
- less than 5% contamination is required

Functional analysis of metagenomes

Strain-level analysis using PanPhlan

If you want to run PanPhlan tool for a specific species and get strain-level resolution, you might want to check if they provide a database for this species here. If they do, we already have them downloaded in /lustre/scratch118/infgen/team162/ys4/panphlan/panphlan_db. If not, you could build your own folowwing this tutorial.

Extract 16S for a list of assemblies

There is a wrapper script for [rnammer]() found in /nfs/team162/kv4/bin/run_rnammer.sh. It takes a list of paths to assembly in a text file as well as a output directory to store the results. Example:

sh /nfs/team162/kv4/bin/run_rnammer.sh /nfs/team162/kv4/bin/test_rnammer.txt ./16S_extract/

Python library classes

This section describes the built-in classes created in the metagm library. They can be used to improve further Python script development or add features in an efficient manner.

`BacterialGenome()` class:

This class creates a genome object from a sequence file and stores various information about it:

genomename: name given to the genome (default: ``)
genomefilepath: location of the sequence file (default: ``)
tmpdir: where temporary results are stored (default: tmp)
qc: if true, will run quality control on this genome (default: true)
seqrecord: fasta sequence efficently stored in python session (default: none)
taxid: known taxonomic ID for this genome (default: none)
valid: whether this genome passed quality control (default: none)
VALIDATED_GENOMES: text file containing genomes that passed QC (default: ValidatedGenomes.txt)
FILTERED_GENOMES: text file containing genomes that failed QC (default: FilteredGenomes.txt)
LOG: text file containing reasons genomes failed QC (default: log.txt)
MAX_CONTIGS: maximum number of contigs to pass QC (default: 400)
MAX_SIZE: maximum assembly size to pass QC (default: 8000000)
COMPLETENESS_THRESHOLD: minimum assembly completeness to pass QC (default: 90)
CONTAMINATION_THRESHOLD: maximum assembly contamination to pass QC (default: 5)
QUEUE: LSF queue to be used when dealing with this genome (default: long)
NTHREADS: number of threads to be used when dealing with this genome (default: 8)
MEMORY: memory (Gb) to be used when dealing with this genome (default: 8)
ASSEMBLER: default assembler to be used (default: spades)
DEFAULT_ROARY_CUTOFF: threshold for Roary operation (default: 0.99)
annotation: location of an annotation file for this genome (default: none)
seq16s: existing 16S sequence for this genome (default: none)
filehash: location of hash file (default: none)

Examples

#in python3

#load libraries
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.BacterialGenome import BacterialGenome

# create a genome using its sequence
g = BacterialGenome(genomefilepath='/lustre/scratch118/infgen/team162/kv4/working_list_genomes_kraken/GCA_900129655.1_IMG-taxon_2695420967_annotated_assembly_genomic.fna',  genomename='Bacteroides clarus GCF_900129655', taxid='626929')

# it will run QC on it first (default)
# qc=False can be used in the previous command to avoid running QC

# then, asking if the genome passed QC
g.is_valid()
# True means the genome passed QC

# then, extract the 16S sequence of this genome (rnammer)
g.get_16s_sequence()
g.seq16s # it is a SeqRecord object
len(g.seq16s) # one 16S copy
#get the assocaited 16S fasta sequence
print(g.seq16s["rRNA_FQWK01000014.1_260-1774_DIR+"].format("fasta"))

`GenomeList()` class:

This class creates a object made of multiple BacterialGenome and makes it easy to serialize analyses. Its components are:

genomelist: a list of BacterialGenome objects (default: list())
taxidlist: a list of taxids associated with genomes (default: None)
taxidfile: a file where taxids are stored (default: None)
genomefilepathlist: a list of paths where genome assemblies are located (default: {})
inputfile: a text file containing the paths of all genomes (default: None)
QUEUE: LSF queue to be used when dealing with this genome (default: long)
NTHREADS: number of threads to be used when dealing with this genome (default: 4)
MEMORY: memory (Gb) to be used when dealing with this genome (default: 8)
TMPDIR: where temporary results are stored (default: tmp)

Examples

#in python3.6

#load libraries
import csv
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.BacterialGenome import BacterialGenome
from metagm.sequences.GenomeList import GenomeList

# create list of genomes from an input file
print('Loading list of genomes...')
glist = GenomeList(inputfile='/nfs/team162/kv4/bin/test_list_genomes.txt')
with open(glist.inputfile) as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    for i, row in enumerate(reader):
        g = BacterialGenome(genomefilepath=row[0], genomename=row[1], qc=False)
        # append to existing list
        glist.append(g, checkvalid=False)
#get list length -->10
len(glist)

#get name of first genome in list
glist.genomelist[0].genomename

#get path to third genome in list
glist.genomelist[3].genomefilepath

#remove the first genome in the list using its name
glist.remove(genomeidentifier=glist.genomelist[0].genomename)
#get list length --> 9
len(glist)

#get name of first genome in list --> different than previously
glist.genomelist[0].genomename

#get gtdb taxid for a list of genomes (submit 1 job on long queue, might take some time)
glist.list_assign_taxo_gtdb()
# create a 'myTaxids.txt' file with the results and store the results in the list also
glist.taxidfile
# get the list of taxids attached to the genomes
glist.taxidlist

`Metagenome()` class:

This class stores a metagenome sample: Its components are:

metafastqfile1: a fastq file (default: None)
metafastqfile2: a fastq file, used only if data are paired (default: None)
pairedFlag: are the data paired-end? (default: False)
zippedFlag: are the data zipped? (default: False)
krakenReport: a path to Kraken2 report (default: None)
brackenReport: a path to Bracken report (default: None)
mashReport: a path to Mash report (default: None)
metaphlanReport: a path to MetaPhlan2 report (default: None)
QUEUE: LSF queue to be used when dealing with this genome (default: long)
NTHREADS: number of threads to be used when dealing with this genome (default: 4)
MEMORY: memory (Gb) to be used when dealing with this genome (default: 8)
TMPDIR: where temporary results are stored (default: tmp)

Examples

#in python3.6

#load libraries
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.Metagenome import Metagenome

# create a metagenome object from paired fastq
mg = Metagenome(metagenomename='ERR2835563', metafastqfile1='/lustre/scratch118/infgen/pathogen/pathpipe/pathogen_prok_external/seq-pipelines/human/gut_metagenome/TRACKING/235/PS.170.S4.5.metagenomic/SLX/ERX2842313/ERR2835563/ERR2835563_1.fastq.gz',metafastqfile2='/lustre/scratch118/infgen/pathogen/pathpipe/pathogen_prok_external/seq-pipelines/human/gut_metagenome/TRACKING/235/PS.170.S4.5.metagenomic/SLX/ERX2842313/ERR2835563/ERR2835563_2.fastq.gz')

# builder automatically detect if data are paired-end
mg.pairedFlag
# builder automatically detect if data are zipped
mg.zippedFlag

`MetagenomeList()` class:

This class creates a object made of multiple Metagenome and makes it easy to serialize analyses. Its components are:

metagenomelist: a list of Metagenome objects (default: list())
metaphlanFlag: Has MetaPhlan2 already been run on these metagenomes? (default: False)
brackenrank: taxonomic rank for Bracken prediction (default: S for species)
metaphlanID: a list of MetaPhlan2 job IDs to manage Humann2 dependencies (default: None)
outputDir: where output files are stored (default: tmp)
brackenOutputList: list of files created by Bracken. It is used for the merge step (default: list())

Examples

#in python3.6

#load libraries
import csv
import sys
import os
import re
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.Metagenome import Metagenome
from metagm.sequences.MetagenomeList import MetagenomeList

# create list of metagenomes from an input file
mglist = MetagenomeList()
with open('/nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt') as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    for i, row in enumerate(reader):
        # retrieve the paired-end fastq from folder path only
        files = [f for f in os.listdir(os.path.dirname(row[0])) if re.match(os.path.basename(row[0])+'_[12].fastq.gz$', f)]
        mg=Metagenome(metagenomename=os.path.basename(row[0]),metafastqfile1=os.path.dirname(row[0])+'/'+files[0], metafastqfile2=os.path.dirname(row[0])+'/'+files[1])
        # append
        mglist.append(mg)

#get list length --> 5
len(mglist)

#classify these samples with Kraken (submit jobs only and return the job IDs)
mglist.classify_kraken('/nfs/pathogen005/team162/Kraken0419_kraken2_taxo_resolved')

#results are stored in './tmp'

`TaxonomyTree()` class:

This class stores a tree structure using the ETE3 toolkit. Its components are:

tree: a ETE3 tree object (default: Tree())
maxtaxid: largest taxid used in the tree (default: 1)

Examples

#in python3.6

#load libraries
import pickle
import csv
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.phylogeny.TaxonomyTree import TaxonomyTree

# create a tree object from a list of 10 paths
tree = TaxonomyTree('/nfs/team162/kv4/bin/test_list_taxopaths.txt')

tree.maxtaxid
#51 nodes were created

#print the tree in the terminal (ASCII format)
tree.printAscii()

#save tree in NCBI format
tree.saveNCBIFormat('./')

#save tree in Pickle format (can be loaded in any Python session)
tree.savePickleFormat('./taxo.pyc')

TODO

This section includes features that are either under-development, or not implemented yet:

use Sanger lane IDs (e.g., 13470_2#55) instead of absolute paths to assemblies
improve metagm_classify.py data loding step, by using split command to generate batch files instead of the existing 'for' loop. IT will also rmeove the extra loop currently used for the last batch.

kevinVervier / metagm

readme

metagm: the metagenomic platform from Team162

How to cite

Before starting

metagm_build module

Inputs (mandatory)

Options

Outputs

Examples

Quality control + taxonomic assignment on a list of genomes

Quality control + taxonomic assignment on a list of genomes (faster)

Kraken/Bracken database on a list of validated genomes

Kraken/Bracken database on a list of non-QC genomes

Comments

metagm_classify module

Inputs (mandatory)

Options

Outputs

Comments

Available Kraken/Bracken databases:

Examples

Bracken prediction for a list of metagenomes

Bracken prediction for a list of metagenomes at genus level

Detecting genomes in a list of metagenomes with Mash

Functional characterization of a list of metagenomes with Humann2

Taxonomy

Taxonomic assignment

How to build a tree in NCBI format using gtdb metadata

How to add nodes to an existing taxonomic tree

Statistical analysis

MySQL Knowledge database

Connection Setup

Genomes table

Miscalleneous

Other utility functions/scripts

Quality control

Functional analysis of metagenomes

Strain-level analysis using PanPhlan

Extract 16S for a list of assemblies

Python library classes

BacterialGenome() class:

Examples

GenomeList() class:

Examples

Metagenome() class:

Examples

MetagenomeList() class:

Examples

TaxonomyTree() class:

Examples

TODO

`BacterialGenome()` class:

`GenomeList()` class:

`Metagenome()` class:

`MetagenomeList()` class:

`TaxonomyTree()` class: