medema-group / BiG-MAP

Other
25 stars 7 forks source link

This is the Github repository for the Biosynthetic Gene cluster Meta'omics abundance Profiler (BiG-MAP). Metabolic Gene Clusters (MGCs) are responsible of the synthesis of small molecules, that are known to have a major impact on the host. To evaluate their contribution to a given host phenotype, it is important to assess the MGC abundance and expression profile in a given dataset. Thus, we have built BiG-MAP, a bioinformatic tool to profile abundance and expression levels of gene clusters across metagenomic and metatranscriptomic data and evaluate their differential abundance and expression between different conditions. BiG-MAP is composed of 4 different modules:

For information on how to install and run BiG-MAP, check this tutorial. Please, take into account that this tool has been only tested on Ubuntu.

+BEGIN_EXAMPLE

~$ git clone https://github.com/medema-group/BiG-MAP.git

+END_EXAMPLE

Then install all the dependencies from the BiG-MAP.yml file with:

+BEGIN_EXAMPLE

For BiG-MAP.download.py, BiG-MAP.family.py and BiG-MAP.map.py

~$ conda env create -f BiG-MAP_process.yml BiG-MAP_process ~$ conda activate BiG-MAP_process

For BiG-MAP.analyse.py

~$ conda env create -f BiG-MAP_analyse.yml BiG-MAP_analyse ~$ conda activate BiG-MAP_analyse

+END_EXAMPLE

This step is optional but to make use of the second redundancy filtering step in the BiG-MAP.family module, download BiG-SCAPE using:

+BEGIN_EXAMPLE

~$ git clone https://github.com/medema-group/BiG-SCAPE/

+END_EXAMPLE

Further below you will find more information on how to install BiG-SCAPE and its dependencies, but you can also check the wiki page for more information: https://github.com/medema-group/BiG-SCAPE/wiki.

The four modules are described more in detail below where you will also find the commands to run them.

** 1) BiG-MAP.download.py This script is created to easily download the metagenomic and/or metatranscriptomic samples available in the SRA repository (https://www.ncbi.nlm.nih.gov/sra). First, the samples are downloaded in /.SRA/ format, and then they are converted into /.fastq/ using /fastq-dump/.

+BEGIN_EXAMPLE

conda activate BiG-MAP_process python3 BiG-MAP.download.py -h python3 BiG-MAP.download.py [Options]* -A [accession_list_file] -O [path_to_outdir]

+END_EXAMPLE

To download the samples, go to the [[https://www.ncbi.nlm.nih.gov/Traces/study/][SRA run selector]] and use the BioProject record of interest. Next, from the resulting page, download the Accession list. This is a file that contains a list of sample accessions and it is used as input for the BiG-MAP.download module. In this tutorial, the IBD-cohort of Schirmer et al. (2018) is used, thus the Accession list of the BioProject PRJNA389280 was downloaded. With the aim of simplifying the tutorial and speeding up the analysis, 8 metagenomic samples were chosen: 4 samples from patients suffering Crohn Disease (CD) and 4 from Ulcerative Colitis (UC). Create a file with the following 8 samples' accessions and run BiG-MAP.download command to get the fastq files:

+BEGIN_EXAMPLE

Acc_list.txt: SRR5947837 SRR5947861 SRR5947824 SRR5947881 SRR5947836 SRR5947862 SRR5947855 SRR5947841

python3 BiG-MAP.download.py -A Acc_list.txt -O /usr001/fastq/schirmer/

+END_EXAMPLE

** 2) BiG-MAP.family.py The main purpose of this module is to group gene clusters into GCFs using sequence similarity. The first redundancy filtering step is performed by MASH, that by default uses 0.8 sequence similarity cut-off but can be changed as desired. Additionally, a second round of redundancy filtering can be performed by using BiG-SCAPE. We strongly recommend using BiG-SCAPE for a more accurate redundancy filtering. For that, look at the BiG-SCAPE wiki on how to install it: https://git.wageningenur.nl/medema-group/BiG-SCAPE/-/wikis/installation. To run BiG-SCAPE, you will also need to have the latest (processed) Pfam database /Pfam-A.hmm.gz/ available from the Pfam FTP website (https://pfam.xfam.org/). Once the /Pfam-A.hmm.gz/ file is downloaded, uncompress it and process it using the /hmmpress/ command from the HMMER suit (http://hmmer.org/).

BiG-MAP.family takes as input the output directories of any anti- or gutSMASH run. Given a set of genomes, gutSMASH/antiSMASH can predict multiple gene clusters, thus the output folders containing the predicted gene clusters for each genome are the ones used as input for this module. Please, take into account that it needs to be run beforehand. To follow this tutorial without previously running anti- or gutSMASH, you can find 10 exemplary gutSMASH output folders in here: [[https://github.com/medema-group/BiG-MAP/tree/master/example_data][example data]] folder.

To make use of the tutorial gutSMASH folders, first extract the files using:

+BEGIN_EXAMPLE

tar -xf BiG-MAP_tutorial_genomes.tar.gz

+END_EXAMPLE

The general usage of BiG-MAP.family is:

+BEGIN_EXAMPLE

conda activate BiG-MAP_process python3 BiG-MAP.family.py -h python3 BiG-MAP.family.py [Options]* -D [input dir(s)] -O [output dir]

+END_EXAMPLE

Check the command below to see how to run this module with the tutorial samples and BiG-SCAPE:

+BEGIN_EXAMPLE

python3 BiG-MAP.family.py -D /usr001/BiG-MAP_tutorial_genomes/ -b /usr001/BiG-SCAPE_location/ -pf /usr001/pfam_files_location/ -O /usr001/results_family/

This yields: BiG-MAP.GCF.bed = Bedfile to extract core regions in BiG-MAP.map.py BiG-MAP.GCF.fna = Reference file to map the WGS reads to BiG-MAP.GCs.json = Dictionary that contains the GCFs BiG-MAP.GCF.json = Dictionary that contains the BiG-SCAPE GCFs

+END_EXAMPLE

In general, the anti- or gutSMASH-output folder should contain the results of at least several runs. Optional flags to run this module include:

-tg: Fraction between 0 and 1; the similarity threshold that determines when the protein sequences of the gene clusters can be considered similar. If the threshold is set to zero, all gene clusters will form their own gene cluster families, whereas a threshold of one will result in one large family containing all gene clusters. Default = 0.8.

-th: Fraction between 0 and 1; the similarity threshold that determines when the protein sequences of the housekeeping genes can be considered similar. Default = 0.1

-f: Specify here the number of genes that are flanking the core genes of the gene cluster. 0 --> only the core, n --> n genes included that flank the core. Default = 0

-g: Output whole genome fasta files for the MASH filtered gene clusters as well. This uses more disk space in the output directory. 'True' | 'False'. Default = False

-p: Number of used parallel threads in the BiG-SCAPE filtering step. Default = 6

NOTE: the number of predicted MGCs may exceed the maximum number of sequences that MASH (“sketch” and “dist” functions) is able to compare, leading to an error. In this scenario, the family module can be run in batches or the code can be slightly modified to manually run the MASH analysis and MASH “paste” function (more information is available in their documentation at https://mash.readthedocs.io/en/latest/) and pick up the analysis again from that step onwards.

** 3) BiG-MAP.map.py This module is designed to align the WGS (paired or unpaired) reads to the reference representatives of each GCF and HGF using /bowtie2/. The following will be computed: RPKM, coverage, core coverage. The coverage is calculated using /Bedtools/, and the read count values using /Samtools/. The general usage is:

+BEGIN_EXAMPLE

conda activate BiG-MAP_process python3 BiG-MAP.map.py -h python3 BiG-MAP.map.py {-I1 [mate-1s] -I2 [mate-2s] | -U [samples]} {-F [family] | -P [pickled file]} -O [outdir] -b [metadata] [Options*]

+END_EXAMPLE

To map the 8 samples from Schirmer et al. (2018) to the GCF reference representatives, and correct for the BiG-SCAPE GCF size, run:

NOTE: It is important for downstream analysis to also use the /-b/ flag. Also, if it is prefered to use the averaged number of reads mapped per GCF (instead of summed), the flag /-a or --average/ needs to be included in the command below

+BEGIN_EXAMPLE

python3 BiG-MAP.map.py -b /usr001/results/schirmer_metadata.txt -I1 /usr001/fastq/schirmer/pass_1 -I2 /usr001/fastq/schirmer/pass_2 -O /usr001/results_mapping/ -F /usr001/results_family/

the schirmer_metadata.txt is set up as follows (tab-delimited):

run.ID host.ID SampleType DiseaseStatus

SRR5947837 M2026C2_MGX METAGENOMIC UC SRR5947861 M2026C3_MGX METAGENOMIC UC SRR5947824 M2026C4_MGX METAGENOMIC UC SRR5947881 M2026C7_MGX METAGENOMIC UC SRR5947836 M2027C1_MGX METAGENOMIC CD SRR5947862 M2027C2_MGX METAGENOMIC CD SRR5947855 M2027C3_MGX METAGENOMIC CD SRR5947841 M2027C5_MGX METAGENOMIC CD

note the '#' to denote the header row!!!

+END_EXAMPLE

** 4) BiG-MAP.analyse.py This module performs a statistical analysis on the metagenomic/metatranscriptomic samples. First, the script normalizes and filters the data. Whereafter, the best covered gene clusters can be observed using the /--explore/ flag. Next, the Kruskal Wallis and fitZIG model will be used to compute differentially abundant/expressed gene clusters and Benjamini-Hochberg FDR compensates for multiple hypothesis testing. The output of the script are several heatmaps in pdf format.

To run the script, the BiG-MAP_analyse conda environment should be activated. The general usage is:

+BEGIN_EXAMPLE

conda activate BiG-MAP_analyse python3 BiG-MAP.analyse.py -h python3 BiG-MAP.analyse.py --explore --compare -B [biom_file] -T [metagenomic/metatranscriptomic] -M [metagroup] -O [outdir] [Options*]

Example command for the explore heatmap: python3 BiG-MAP.analyse.py --explore -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -O /usr001/results_analysis

Example command for the compare heatmap: python3 BiG-MAP.analyse.py --compare -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -g UC CD -O /usr001/results_analysis

Example command including both the explore and the compare heatmap: python3 BiG-MAP.analyse.py --explore --compare -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -g UC CD -O /usr001/results_analysis

Note: You can either choose between the BiG-MAP.map.metacore.dec.biom or the BiG-MAP.mapcore.metacore.dec.biom as -B flag input file, depending if you are interested on plotting the results for the whole gene clusters or only the core genomic region of the gene clusters respectively.

Output: explore_heatmap.pdf & explore_heatmap.eps -> contains the top 20 best covered gene clusters UCvsCD_fz.pdf & UCvsCD.eps -> comparison between UC and CD using the fitZIG model UCvsCD_kw.pdf & UCvsCD_kw.eps -> comparison between UC and CD using the Kruskal Wallis model tsv-results -> directory containing tsv files with the raw data

+END_EXAMPLE

+BEGIN_EXAMPLE

~$ git clone https://github.com/medema-group/BiG-MAP.git

+END_EXAMPLE

Install Snakemake with the following command:

+BEGIN_EXAMPLE

~$ conda create -n snakemake -c conda-forge -c bioconda snakemake=6.0.2 ~$ conda activate snakemake

+END_EXAMPLE

Next, copy the BiG-MAP_snakemake folder to the preferred output location

+BEGIN_EXAMPLE

~$ cp -r BiG-MAP/BiG-MAP_snakemake/ /path/to/output/location/

+END_EXAMPLE

Navigate to the BiG-MAP_snakemake folder and adjust the config.yaml file. In this file, the locations of files and folders should be included based on the wanted BiG-MAP run settings. After adjusting the config file, use the following command to start the BiG-MAP run:

+BEGIN_EXAMPLE

~$ snakemake --use-conda --cores 10

+END_EXAMPLE

NOTE: It's recommended to use a conda-prefix location to a folder in which the BiG-MAP conda environments can be installed (--conda-prefix /path/to/snakemake/conda/envs/). Additionally, the number of cores used by BiG-MAP can be adjusted with the --cores flag.

** Software:

** Packages: *** Python