[[https://github.com/medema-group/BiG-MAP/blob/master/Pipeline_overview.png]]
This is the Github repository for the Biosynthetic Gene cluster Meta'omics abundance Profiler (BiG-MAP). Metabolic Gene Clusters (MGCs) are responsible of the synthesis of small molecules, that are known to have a major impact on the host. To evaluate their contribution to a given host phenotype, it is important to assess the MGC abundance and expression profile in a given dataset. Thus, we have built BiG-MAP, a bioinformatic tool to profile abundance and expression levels of gene clusters across metagenomic and metatranscriptomic data and evaluate their differential abundance and expression between different conditions. BiG-MAP is composed of 4 different modules:
For information on how to install and run BiG-MAP, check this tutorial. Please, take into account that this tool has been only tested on Ubuntu.
~$ git clone https://github.com/medema-group/BiG-MAP.git
Then install all the dependencies from the BiG-MAP.yml file with:
~$ conda env create -f BiG-MAP_process.yml BiG-MAP_process ~$ conda activate BiG-MAP_process
~$ conda env create -f BiG-MAP_analyse.yml BiG-MAP_analyse ~$ conda activate BiG-MAP_analyse
This step is optional but to make use of the second redundancy filtering step in the BiG-MAP.family module, download BiG-SCAPE using:
~$ git clone https://github.com/medema-group/BiG-SCAPE/
Further below you will find more information on how to install BiG-SCAPE and its dependencies, but you can also check the wiki page for more information: https://github.com/medema-group/BiG-SCAPE/wiki.
The four modules are described more in detail below where you will also find the commands to run them.
** 1) BiG-MAP.download.py This script is created to easily download the metagenomic and/or metatranscriptomic samples available in the SRA repository (https://www.ncbi.nlm.nih.gov/sra). First, the samples are downloaded in /.SRA/ format, and then they are converted into /.fastq/ using /fastq-dump/.
conda activate BiG-MAP_process python3 BiG-MAP.download.py -h python3 BiG-MAP.download.py [Options]* -A [accession_list_file] -O [path_to_outdir]
To download the samples, go to the [[https://www.ncbi.nlm.nih.gov/Traces/study/][SRA run selector]] and use the BioProject record of interest. Next, from the resulting page, download the Accession list. This is a file that contains a list of sample accessions and it is used as input for the BiG-MAP.download module. In this tutorial, the IBD-cohort of Schirmer et al. (2018) is used, thus the Accession list of the BioProject PRJNA389280 was downloaded. With the aim of simplifying the tutorial and speeding up the analysis, 8 metagenomic samples were chosen: 4 samples from patients suffering Crohn Disease (CD) and 4 from Ulcerative Colitis (UC). Create a file with the following 8 samples' accessions and run BiG-MAP.download command to get the fastq files:
Acc_list.txt: SRR5947837 SRR5947861 SRR5947824 SRR5947881 SRR5947836 SRR5947862 SRR5947855 SRR5947841
python3 BiG-MAP.download.py -A Acc_list.txt -O /usr001/fastq/schirmer/
** 2) BiG-MAP.family.py The main purpose of this module is to group gene clusters into GCFs using sequence similarity. The first redundancy filtering step is performed by MASH, that by default uses 0.8 sequence similarity cut-off but can be changed as desired. Additionally, a second round of redundancy filtering can be performed by using BiG-SCAPE. We strongly recommend using BiG-SCAPE for a more accurate redundancy filtering. For that, look at the BiG-SCAPE wiki on how to install it: https://git.wageningenur.nl/medema-group/BiG-SCAPE/-/wikis/installation. To run BiG-SCAPE, you will also need to have the latest (processed) Pfam database /Pfam-A.hmm.gz/ available from the Pfam FTP website (https://pfam.xfam.org/). Once the /Pfam-A.hmm.gz/ file is downloaded, uncompress it and process it using the /hmmpress/ command from the HMMER suit (http://hmmer.org/).
BiG-MAP.family takes as input the output directories of any anti- or gutSMASH run. Given a set of genomes, gutSMASH/antiSMASH can predict multiple gene clusters, thus the output folders containing the predicted gene clusters for each genome are the ones used as input for this module. Please, take into account that it needs to be run beforehand. To follow this tutorial without previously running anti- or gutSMASH, you can find 10 exemplary gutSMASH output folders in here: [[https://github.com/medema-group/BiG-MAP/tree/master/example_data][example data]] folder.
To make use of the tutorial gutSMASH folders, first extract the files using:
tar -xf BiG-MAP_tutorial_genomes.tar.gz
The general usage of BiG-MAP.family is:
conda activate BiG-MAP_process python3 BiG-MAP.family.py -h python3 BiG-MAP.family.py [Options]* -D [input dir(s)] -O [output dir]
Check the command below to see how to run this module with the tutorial samples and BiG-SCAPE:
python3 BiG-MAP.family.py -D /usr001/BiG-MAP_tutorial_genomes/ -b /usr001/BiG-SCAPE_location/ -pf /usr001/pfam_files_location/ -O /usr001/results_family/
This yields: BiG-MAP.GCF.bed = Bedfile to extract core regions in BiG-MAP.map.py BiG-MAP.GCF.fna = Reference file to map the WGS reads to BiG-MAP.GCs.json = Dictionary that contains the GCFs BiG-MAP.GCF.json = Dictionary that contains the BiG-SCAPE GCFs
In general, the anti- or gutSMASH-output folder should contain the results of at least several runs. Optional flags to run this module include:
-tg: Fraction between 0 and 1; the similarity threshold that determines when the protein sequences of the gene clusters can be considered similar. If the threshold is set to zero, all gene clusters will form their own gene cluster families, whereas a threshold of one will result in one large family containing all gene clusters. Default = 0.8.
-th: Fraction between 0 and 1; the similarity threshold that determines when the protein sequences of the housekeeping genes can be considered similar. Default = 0.1
-f: Specify here the number of genes that are flanking the core genes of the gene cluster. 0 --> only the core, n --> n genes included that flank the core. Default = 0
-g: Output whole genome fasta files for the MASH filtered gene clusters as well. This uses more disk space in the output directory. 'True' | 'False'. Default = False
-p: Number of used parallel threads in the BiG-SCAPE filtering step. Default = 6
NOTE: the number of predicted MGCs may exceed the maximum number of sequences that MASH (“sketch” and “dist” functions) is able to compare, leading to an error. In this scenario, the family module can be run in batches or the code can be slightly modified to manually run the MASH analysis and MASH “paste” function (more information is available in their documentation at https://mash.readthedocs.io/en/latest/) and pick up the analysis again from that step onwards.
** 3) BiG-MAP.map.py This module is designed to align the WGS (paired or unpaired) reads to the reference representatives of each GCF and HGF using /bowtie2/. The following will be computed: RPKM, coverage, core coverage. The coverage is calculated using /Bedtools/, and the read count values using /Samtools/. The general usage is:
conda activate BiG-MAP_process python3 BiG-MAP.map.py -h python3 BiG-MAP.map.py {-I1 [mate-1s] -I2 [mate-2s] | -U [samples]} {-F [family] | -P [pickled file]} -O [outdir] -b [metadata] [Options*]
To map the 8 samples from Schirmer et al. (2018) to the GCF reference representatives, and correct for the BiG-SCAPE GCF size, run:
NOTE: It is important for downstream analysis to also use the /-b/ flag. Also, if it is prefered to use the averaged number of reads mapped per GCF (instead of summed), the flag /-a or --average/ needs to be included in the command below
python3 BiG-MAP.map.py -b /usr001/results/schirmer_metadata.txt -I1 /usr001/fastq/schirmer/pass_1 -I2 /usr001/fastq/schirmer/pass_2 -O /usr001/results_mapping/ -F /usr001/results_family/
the schirmer_metadata.txt is set up as follows (tab-delimited):
SRR5947837 M2026C2_MGX METAGENOMIC UC SRR5947861 M2026C3_MGX METAGENOMIC UC SRR5947824 M2026C4_MGX METAGENOMIC UC SRR5947881 M2026C7_MGX METAGENOMIC UC SRR5947836 M2027C1_MGX METAGENOMIC CD SRR5947862 M2027C2_MGX METAGENOMIC CD SRR5947855 M2027C3_MGX METAGENOMIC CD SRR5947841 M2027C5_MGX METAGENOMIC CD
note the '#' to denote the header row!!!
** 4) BiG-MAP.analyse.py This module performs a statistical analysis on the metagenomic/metatranscriptomic samples. First, the script normalizes and filters the data. Whereafter, the best covered gene clusters can be observed using the /--explore/ flag. Next, the Kruskal Wallis and fitZIG model will be used to compute differentially abundant/expressed gene clusters and Benjamini-Hochberg FDR compensates for multiple hypothesis testing. The output of the script are several heatmaps in pdf format.
To run the script, the BiG-MAP_analyse conda environment should be activated. The general usage is:
conda activate BiG-MAP_analyse python3 BiG-MAP.analyse.py -h python3 BiG-MAP.analyse.py --explore --compare -B [biom_file] -T [metagenomic/metatranscriptomic] -M [metagroup] -O [outdir] [Options*]
Example command for the explore heatmap: python3 BiG-MAP.analyse.py --explore -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -O /usr001/results_analysis
Example command for the compare heatmap: python3 BiG-MAP.analyse.py --compare -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -g UC CD -O /usr001/results_analysis
Example command including both the explore and the compare heatmap: python3 BiG-MAP.analyse.py --explore --compare -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -g UC CD -O /usr001/results_analysis
Note: You can either choose between the BiG-MAP.map.metacore.dec.biom or the BiG-MAP.mapcore.metacore.dec.biom as -B flag input file, depending if you are interested on plotting the results for the whole gene clusters or only the core genomic region of the gene clusters respectively.
Output: explore_heatmap.pdf & explore_heatmap.eps -> contains the top 20 best covered gene clusters UCvsCD_fz.pdf & UCvsCD.eps -> comparison between UC and CD using the fitZIG model UCvsCD_kw.pdf & UCvsCD_kw.eps -> comparison between UC and CD using the Kruskal Wallis model tsv-results -> directory containing tsv files with the raw data
Snakemake workflow This Snakemake workflow allows for a more automated and streamlined running of the separated BiG-MAP modules. For more information on how to install and run the Snakemake version of BiG-MAP, check the instructions below.
Installation and run overview Install BiG-MAP dependencies using conda. Conda can be installed from [[https://docs.conda.io/en/latest/miniconda.html][miniconda_link]]. First pull the BiG-MAP repository from github:
~$ git clone https://github.com/medema-group/BiG-MAP.git
Install Snakemake with the following command:
~$ conda create -n snakemake -c conda-forge -c bioconda snakemake=6.0.2 ~$ conda activate snakemake
Next, copy the BiG-MAP_snakemake folder to the preferred output location
~$ cp -r BiG-MAP/BiG-MAP_snakemake/ /path/to/output/location/
Navigate to the BiG-MAP_snakemake folder and adjust the config.yaml file. In this file, the locations of files and folders should be included based on the wanted BiG-MAP run settings. After adjusting the config file, use the following command to start the BiG-MAP run:
~$ snakemake --use-conda --cores 10
NOTE: It's recommended to use a conda-prefix location to a folder in which the BiG-MAP conda environments can be installed (--conda-prefix /path/to/snakemake/conda/envs/). Additionally, the number of cores used by BiG-MAP can be adjusted with the --cores flag.
** Software:
** Packages: *** Python