DGINN is a pipeline dedicated to the detection of genetic innovations, starting from a nucleotidic sequence.
It automatizes all the necessary preliminary steps for evolutionary analyses, including retrieval of homologs, assignment to orthology groups, codon alignment and reconstruction of gene phylogeny.
Once the alignements and corresponding phylogenies are obtained, three major genetic innovations are detected: duplication events, recombination events, and signatures of positive selection.
DGINN was validated on nineteen primate genes with known evolutionary histories, and results can be consulted in the associated paper (doi: https://doi.org/10.1093/nar/gkaa680). Results from the validation are available in the corresponding repository. The version of DGINN used in the paper refers to commit 5db0253 and can be fetched through:
git init
git remote add origin https://github.com/leapicard/DGINN
git fetch --depth 1 origin 5db02532408afcafad50a0b70dcf247ab4800492
git checkout FETCH_HEAD
The docker is available for both the paper version and the current version of DGINN.
Any questions or suggestions about the program can be addressed to lea.picard [at] ens-lyon.fr, laurent.gueguen [at] univ-lyon1.fr or lucie.etienne [at] ens-lyon.fr.
The pipeline is organized using snakemake.
The simplest way to use DGINN is through the use of a conda environment or a container, which frees the user from the necessity of installing all of DGINN's dependencies, and should make cross-platform usage possible (Linux/Mac OS/Windows).
We provide a conda setting for DGINN, through file environment.yml. To run locally, first create and activate a conda environment from environment.yml:
conda env create --file=environment.yml
conda activate dginn
then, in the working directory
snakemake -s path_to_Snakefile --cores 1 --configfile=configuration_file
In case you want to allow more than one core for the analysis, set up the "--cores" option accordingly. If the number is omitted (i.e., only --cores is given), the number of used cores is determined as the number of available CPU cores in the machine.
The user can use either of the images that we provide through Docker or Apptainer, so the only software installation needed is the one for the chosen container system.
Please be aware that, due to Docker necessitating root access, the Docker container is not appropriate for usage in cluster environments, though it is appropriate for cloud computing (tutorial to come) and local usage. The Singularity container should be usable in every environment.
To use docker you will have to clone this repository first, and then build the docker image with:
docker build . -t dginn
After that, you will be able to run DGINN with:
docker run --rm -u $(id -u $USER) -v $(pwd):/local dginn --cores 1 --configfile config_example.yaml
The command should be run as is, and should work on both Mac and Linux systems, provided the user belong to the 'docker' group (please refer to the Docker Documentation for help about setting the user as part of this group on Linux.)
We unfortunately cannot promise about the Docker container's usability on Windows. In case the container doesn't work, we advise the user to try the Apptainer container.
To use an Apptainer or Singularity container you will have to clone this repository. You can then build the image by running the following at its root:
apptainer build dginn.sif Apptainer
To use the container, you can run the following from the root of the repository:
apptainer run dginn.sif --cores 1 --configfile config_example.yaml
If you want to run DGINN from another folder, you can specify the path to the Snakefile file in the cloned repository:
apptainer run /path/to/dginn.sif -s /path/to/Snakefile --cores 1 --configfile config_example.yaml
DGINN uses a parameter file to pass all the necessary arguments for launching the pipeline. Two example files are provided in the examples directory:
This is the recommended usage for DGINN, so that analyses for positive selection can be parallelized over all alignments instead of doing them sequentially.
Please be aware that fasta sequence name and queryName must follow the format speSpe_GENE_Id for matching (ex: homSap_MX1_CCDS13673, macMul_APOBEC3G_NM_001198693).
# Path or list of paths (absolute or relative) to the files needed to start the pipeline
# Please refer to **3/ Entry steps** for necessary files
infile:
# Output directory for all results
# Automatically created if not specified
outdir:
# Path to a file where progress of the pipeline will be logged
# Automatically created if not specified
logfile:
##################################
### STEP
##################################
# Step at which to enter the pipeline (default: blast)
# Please refer to 3/ Entry steps for names and necessary files
step:
##################################
### BLAST
##################################
# NCBI database on which the blast is to be performed (ex: nr)
# Future implementations will include the possibility to perform the search on local databases
blastdb:
# E-value for Blast (default: 10⁻⁴)
evalue:
# Coverage for Blast (default: 50)
mincov:
# Percentage of identity for Blast (default: 70)
percID:
#################################
### QUERY
#################################
# Option for eliminating overly long sequences (default: cutoff(3))
# IQR or cutoff, factor can be put after in parenthesis
# cutoff will delete all sequences longer than (factor) times the median of the distribution
# IQR will delete all sequences longer than the third quartile plus (factor) times the InterQuartile Range
maxLen:
# Can be used to limit the search on NCBI databases to certain set of species, to exclude others, etc.
# https://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options
entryQuery:
# Identifier of the reference sequence for steps outside of blast and positiveSelection
queryName:
# Determines if Blast is performed against NCBI databases (default: True)
remote:
# NCBI API key to increase Blast speed, obtainable from the NCBI
APIKey:
##################################################
###### ALIGNMENT
##################################################
# Choice of codon aligner: prank or macse (default):
aligner:
##################################################
###### TREE
##################################################
# Choice of tree builder: iqtree or phyml (default)
builder:
# Options for running PhyML
# Input the command in the same way you would to run PhyML yourself in the following manner phyml -i ALN [the rest of your options]
# For example, to run PhyML with a GTR model with 100 bootstraps, the option would be phymlOpt:phyml -i ALN -m GTR -b 100
# Please be aware that PhyML will run even if your options are wrong, but with its own default parameters
phymlOpt:
##################################################
###### ORTHOLOGS
##################################################
# Path to the species tree for the detection of duplication events and ortholog group assignment
# Species names must be formated as speSpe or speSpeSpe (ex: homSap, gorGorGor)
sptree:
# Option for the identification of duplication events (default: False)
duplication:
###############################################
##### CLEANING
###############################################
# Option for Long Branch separation (default: cutoff(50))
# IQR or cutoff, factor can be put after in parenthesis (ex: cutoff(50))
# EXPERIMENTAL
LBopt:
# Minimum number of species for assignment to an ortholog group (default: 8)
nbspecies:
##################################################
###### RECOMBINATION
##################################################
# Option for the detection of recombination events (default: False)
recombination:
##################################################
###### POSITIVE SELECTION
##################################################
# Option for the detection of positive selection (default: False)
positiveSelection:
# P-value for Hyphy methods (BUSTED/MEME) (Pond *et al.*, 2005) (default: 0.1)
hyphySeuil:
# Option for using the Hyphy method BUSTED (Murrel *et al.*, 2015) (default: False)
busted:
# Option for using the Hyphy method BUSTED (Murrel *et al.*, 2015) (default: False)
meme:
# Models to be computed by BIO++ (Gueguen *et al.*, 2013) and/or PAML (Yang, 2007)
# Implemented models: M0, M1, M2, M7, M8, M8a, DFP07, DFP07_0
# Must be comma separated (ex: M0,M1,M2)
#
# Rate distribution are either Constant ou Gamma(n=4)
# Default is Gamma, and explicit rate distribution are available through "_C" or "_G" suffixes to model names (ex: M0_C, M0_G)
models: M0, M1, M2
# Option for using paml for the detection of sites under positive selection (default: False)
paml:
# Option for using BIO++ for the detection of sites under positive selection
# If True, parameter file will be automatically generated
# Can be used to indicate the path to a BIO++ parameter file
bppml:
# Same as previously, but for extracting results from the results computed from bppml
mixedlikelihood:
# Option for using BIO++ for the detection of branches under positive selection
# If True, parameter file will be automatically generated
# Positive selection on each is assessed through LRT M2 vs M1 model in bio++.
# Parameters different from omega are shared between all branches.
opb:
Step | Necessary file(s) | Format |
---|---|---|
blast | CDS of the gene of interest | Fasta |
accessions | List of blast results | NCBI tabulated format (tsv) |
fasta | List of accession identifiers (one/line) | Txt |
orf | mRNA sequences of orthologs | Fasta |
alignment | CDS sequences of orthologs | Fasta |
tree | (codon) alignment of orthologs | Fasta |
duplication | (codon) alignment, gene tree | Fasta, newick |
recombination | (codon) alignment | Fasta |
positiveSelection | codon alignment, gene tree | Fasta, gene tree |
File order must be respected and follow the one indicated in this table.
Though codon alignments are not technically necessary for the phyml, duplication and recombination steps, they are for positiveSelection. Thus, starting at steps upstream of positiveSelection with non codon alignments will probably lead to failure at the positiveSelection step.
DGINN includes different softwares to check for positive selection:
The first three methods are automatically parameterized in DGINN.
For BIO++, the parameter files can be automatically generated by DGINN, but the user can also provide their own parameter files if they wish to tweak the parameters further. The OPB option can also be used for different analyses using Bio++ as its results do not influence any subsequent step. Example parameter files for bppml and bppmixedlikelihoods (for site models) are provided in examples/, as well as a parameter file for running a one-per-branch model.
Users wishing to do the fastest check possible on their genes of interest are encouraged to run only BIO++ site models, as our validation results point to their providing the best compromise of solid results and shorter running times.
In the examples folder, two parameter files are provided.
NB: these files should be updated with the paths to the files referred to instead of just their name when using DGINN through the command line and not through the docker.
snakemake --cores python3 DGINN.py -p parameters.txt
Will launch DGINN steps 1-7 on ex_CCDS.fasta by :
python3 DGINN.py -p parameters_possel.txt
Will launch DGINN step 8 on ex_aln.fasta and ex_genetree.tree by :
DGINN was validated on nineteen primate genes with known evolutionary histories, and results can be consulted on BioRxiv (doi: https://doi.org/10.1101/2020.02.25.964155). Results from the validation are available in the corresponding repository.
Several utility scripts upstream and downstream of DGINN in the etc folder,
multi_dginn.py allows the user to run several DGINN containers in parallel, from a file of several inputs. A maximum number of running containers can be entered (default is 4) and processes are run up to this maximum. Later (when some runs are completed) the same script can be run on the same file of inputs, and successive analyses will be run. So, by calling repetively this script, the user will easily complete as many analyses as wanted.
python3 DGINN/etc/multi_dginn.py dataname -p parameters [-i image][-v][-j jobs]
where:
dataname: name of the file where the input names are stored per
line (aka used as argument of --infile option of
DGINN.py).
-p parameters: name of the DGINN parameters file (aka used with
option -p in DGINN.py).
optional arguments:
-h show this help message and exit
-i image: name of the docker image used (default lpicard/dginn).
-j jobs : number of jobs used in parallel (default 4)
-v verbosity of the commands. (default False)
CCDSquery.py allows the user to download the CCDS sequences of human genes, by providing the properly formatted file obtained through HGNC. This file should at least contain a column titled "Approved symbol" and another titled "CCDS accession".
python3 DGINN/etc/CCDSQuery.py -h
usage: DGINN/etc/CCDSQuery.py [-h] -in <filename>
This program get sequences' genes from HGNC Biomart.
optional arguments:
-h, --help show this help message and exit
Mandatory input infos for running:
-in <filename>, --inFile <filename>
Table of HGNC approved symbols (one per line) and
corresponding CCDS accessions for the genes of
interest, obtained from HGNC Biomart
recup_to_parse produces a file that will be used by parseResult (see below) to parse the output of DGINN analyses.
recup_to_parse is to be run in the directory from which all DGINN analyses have been run, and where the output files "tag_DGINN_date.log" are written. The script scans all those file, keeping the most recent successful one for each tag, in case there have been several analyses.
Message outputs whether analyses seem to have worked well (+tag) or positive selection but no clean exit (~tag), or not at all (-tag). Any gene with + or ~ sign is written in the parsing file.
python3 recup_to_parse.py [-o outfile]
where:
-o outfile: output file that will parsed by parseResults.py
parseResult parses a file describing where results of DGINN are to be found for several genes, and output a summary of the analyses in a single file.
The input file is composed of two tab-separated columns: the first one indicates the full path to the directories containing the positive selection results (the directory containing the subdirectories busted, bpp_site, paml_site, etc.), the second one the full path to the alignments on which those analyses were performed.
Ex: /PATH/TO/GENENAME_sequences_filtered_longestORFs_mafft_mincov_prank_results_TIMESTAMP1/positive_selection_results_TIMESTAMP2 /PATH/TO/GENENAME_sequences_filtered_longestORFs_mafft_mincov_prank.best.fas
The script will output 2 different files:
The different output files obtained with this script can be used to generate figures similar to those exposed in the DGINN paper through the Shiny app, which documentation can be found on the corresponding repository.
python3 DGINN/etc/parseResults.py -h
usage: DGINN/etc/parseResults.py [-h] [-v] -in <filename> [-o <path/to/directory>] [-pr <value>] [-pm <value>]
This program outputs a summary of the results obtained through running DGINN on a list of genes of interest.
optional arguments:
-h, --help show this help message and exit
-v, --version display DGINN/etc/parseResults.py version number and exit
Mandatory input infos for running:
-in <filename>, --inFile <filename>
List of all the directories containing the results from DGINN analyses on different genes, and their corresponding alignments.
Optional input infos (default values):
-o <path/to/directory>, --outdir <path/to/directory>
folder for analysis results (path - by default output file will be saved in the incoming directory)
-pr <value>, --postrate <value>
Threshold posterior probability of omega>1 to admit positive selected sites.
-pm <value>, --pvmeme <value>
Maximum p-value of PS site significance for MEME method.cd
In case of usage of DGINN, please cite: Lea Picard, Quentin Ganivet, Omran Allatif, Andrea Cimarelli, Laurent Guéguen, Lucie Etienne, DGINN, an automated and highly-flexible pipeline for the detection of genetic innovations on protein-coding genes, Nucleic Acids Research, Volume 48, Issue 18, 09 October 2020, Page e103, https://doi.org/10.1093/nar/gkaa680