This workflow is used to generate codon- and gene-based analysis from mitochondrial ribosome profiling data generated using the protocols described in Monitoring mitochondrial translation with ribosome profiling in Nature Protocol by Li et al. 2020.
Dependencies are installed using Bioconda. The workflow is written using Snakemake.
This workflow is designed to take FASTQ files from the Illumina sequencer and map each mitoribosome footprint to its occupied A-site. This mapping will allow detailed monitoring of mitoribosome translation dynamics.
Starting with FASTQ files, the workflow is divided into three main parts: QC and metagene analysis, A-site assignment and codon count table generation, and downstream codon occupancy analysis. In the first part, the workflow will run QC on the input FASTQ files, then align all of the reads to the genome (nuclear and micorhondrial genome together), and finally perform a metagene analysis. The metagene analysis helps the user to determine the offset from the 3' end of the read to the ribosomal A-site. Once the offset is determined, all of the reads will be assigned to the A-site at the nucleotide level and the counts for each codon in each gene are arranged in a table to facilitate downstream analysis. Once the codon count table is generated, two common analyses, which are presented in Figure 7 of the manuscript, It includes codon occupancy analysis and cumulative mitoribosome footprint along the transcripts, are provided to visaulize mitoribosome distribution on mitochondria-encoded genes.
mapped
- BAM alignment files metagene
- [plastid metagene count
] output used to determine the A-site (using stop codon) and P-site (using start codon) offsets codon_count
- codon count table of all samples, separated and combined as one filephasing_analysis
- plastid phaze_by_size
output to estimate sub-codon phasing, stratified by read length.bedgraph
- plastid make_wiggle
output. Genome browser tracks from read alignments, using mapping rules to extract ribosomal A-sites from the alignmentsqc
- the quality control analysis, including read depth and coveragefigures
- all the figurestables
- cumulative codon counts for each gene and codon occupancy analysis resultslogs
- logs from each of the workflow steps, used in troubleshootingtrimmed
- Fastq files that have been trimmed of adapter and low quality sequencesmetagene
to provide a profile of counts relative to start and stop codons to determine the A-site and P-site offset for the experimentphase_by_size
to estimate sub-codon phasing, stratified by read lengthInstall conda
Enable the Bioconda channel (requires 64-bit Linux or Mac OS, Windows is not supported)
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
Clone workflow into working directory
git clone https://github.com/sophiahjli/MitoRiboSeq.git
cd MitoRiboSeq
NOTE: Do not install into a path that includes spaces (' '). Spaces can cause issues with some conda packages.
Input data
FASTQ files - the FASTQ data from the sequencer should
be stored in data/fastq
in fastq.gz
format, one file
per sample.
Edit configuration files as needed
cp code/mito_config.defaults.yml code/mito_config.yml
nano code/mito_config.yml
# Only if running on a cluster
cp cluster_config.yml mycluster_config.yml
nano mycluster_config.yml
Install dependencies into an isolated environment
conda env create --file code/mitoriboseq_environment.yml
Note: If you are updating the workflow, you may need to update the conda environment
conda env update --file code/mitoriboseq_environment.yml
Activate the environment
source activate MitoRiboSeq
Execute the trimming, mapping, read phasing, and metagene workflow
snakemake \
-s code/mito_readphasing_metagene.snakefile \
--configfile "code/mito_config.yml" \
--use-conda \
--cores 4
If --configfile
is not specified, the defaults are used.
Execute the codon occupancy workflow
snakemake \
-s code/mito_codontable.snakefile \
--configfile "code/mito_config.yml" \
--use-conda \
--cores 4
--cores
- Use at most N CPU cores/jobs in parallel. If N is omitted or 'all', the limit is set to the number of available CPU cores. Required--configfile "myconfig.yml"
- Override defaults using the configuration found in myconfig.yml
--dryrun
- Do not execute anything, and display what would be done. If you have a very large workflow, use --dryrun --quiet
to just print a summary of the DAG of jobs.--use-conda
- Use conda to create an environment for each rule, installing and using the exact version of the software required (recommended)--cluster
- Execute snakemake rules with the given submit command, e.g. qsub
. Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present. The submit command can be decorated to make it aware of certain job properties (input, output, params, wildcards, log, threads and dependencies (see the argument below)), e.g.: $ snakemake –cluster ‘qsub -pe threaded {threads}’
.--cluster-config
- A JSON or YAML file that defines the wildcards used in cluster
for specific rules, instead of having them specified in the Snakefile. For example, for rule job
you may define: { ‘job’ : { ‘time’ : ‘24:00:00’ } }
to specify the time for rule job
. You can specify more than one file. The configuration files are merged with later values overriding earlier ones.--set-threads [RULE=THREADS [RULE=THREADS ...]]
- Overwrite thread usage of rules. This allows to fine-tune workflow parallelization. In particular, this is helpful to target certain cluster nodes by e.g. shifting a rule to use more, or less threads than defined in the workflow. Thereby, THREADS has to be a positive integer, and RULE has to be the name of the rule. (default: None)--drmaa
- Execute snakemake on a cluster accessed via DRMAA. Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present. ARGS
can be used to specify options of the underlying cluster system, thereby using the job properties input, output, params, wildcards, log, threads and dependencies, e.g.: --drmaa ‘ -pe threaded {threads}’
. Note that ARGS
must be given in quotes and with a leading whitespace.See the Snakemake documentation for a list of all options.
snakemake --configfile "myconfig.yml" --use-conda --cores 4
snakemake \
--configfile "myconfig.yml" \
--cluster-config "mycluster_config.yml" \
--cluster "sbatch --cpus-per-task={cluster.n} --mem={cluster.memory} --time={cluster.time}" \
--use-conda \
--cores 100
snakemake \
--configfile "myconfig.yml" \
--cluster-config "cetus_cluster.yml" \
--drmaa " --cpus-per-task={cluster.n} --mem={cluster.memory} --qos={cluster.qos} --time={cluster.time}" \
--use-conda \
--cores 1000 \
--output-wait 60