ulelab / peka

Find motifs enriched around prominent crosslinks
GNU General Public License v3.0
5 stars 2 forks source link
bioinformatics bioinformatics-analysis bioinformatics-scripts

PEKA

DOI

Positionally-enriched k-mer analysis (PEKA) is a software package for identifying enriched protein-RNA binding motifs from CLIP datasets. PEKA compares k-mer enrichment in proximity of high-confidence crosslink sites (tXn - thresholded crosslinks), located within crosslinking peaks and having a high cDNA count, relative to low-count crosslink sites located outside of peaks (oXn - outside crosslinks). This approach reduces the effects of technical biases, such as uridine-preference of UV crosslinking. Each k-mer is assigned a PEKA score, which is used to rank the k-mers from the most to the least enriched. Additionally, PEKA provides comprehensive visualisations of motif enrichment profiles around the high-confidence crosslink sites and clusters the motifs that display similar profiles. PEKA also enables motif discovery within specific transcriptomic regions, including or excluding repetitive elements.

To interactively explore PEKA applied to all ENCODE eCLIP data, visit flow.bio. For a more detailed description of PEKA method please refer to our paper:

Kuret, K., Amalietti, A.G., Jones, D.M. et al. Positional motif analysis reveals the extent of specificity of protein-RNA interactions observed by CLIP. Genome Biol 23, 191 (2022). https://doi.org/10.1186/s13059-022-02755-2.

Authors of PEKA source code: aram.amalietti@gmail.com, klara.kuret@ki.si

Dependencies:

python
matplotlib
numpy
pybedtools
scipy
seaborn
plumbum
scikit-learn
pandas
textdistance

Set up

We recommend running PEKA in a conda environment so all the dependencies are managed for you, to set this up run download the environment file and follow the steps below:

# Create the environmnet
conda env create -f environment.yml

# Activate the environment
conda activate peka

# Install the latest version of PEKA from the main branch
pip install git+https://github.com/ulelab/peka@main

Usage

usage: peka.py [-h] -i INPUTPEAKS -x INPUTXLSITES -g GENOMEFASTA -gi GENOMEINDEX -r REGIONS \
               [-k [{3,4,5,6,7}]] \
               [-o [OUTPUTPATH]] \
               [-w [WINDOW]] \
               [-dw [DISTALWINDOW]] \
               [-t [TOPN]] \
               [-p [PERCENTILE]] \
               [-c [CLUSTERS]] \
               [-s [SMOOTHING]] \
               [-re [{remove_repeats,masked,unmasked,repeats_only}]] \
               [-pos [RELEVANT_POS_THRESHOLD] | -relax {True,False}] \
               [-a {True,False}] \
               [-sr {genome,whole_gene,intron,UTR3,other_exon,ncRNA,intergenic}] \
               [-sub {True,False}] \
               [-seed {True,False}]

You can check your installation by going to the TestData directory and running the run_peka.sh script within the activated peka environment. This script uses a small test dataset found in TestData/inputs.

🔴 IMPORTANT NOTE! When you run PEKA on your data, make sure all the required inputs, i.e. bed files, genome in fasta format, genome index and regions file, follow the same naming convention for chromosome names. Either all files must use the UCSC (GENCODE) naming convention, which prefixes chromosome names with "chr" ("chr1", ..., "chrM") or all files should use Ensembl naming convention ("1", ..., "MT").

required arguments:
  -i INPUTPEAKS, --inputpeaks INPUTPEAKS
                        CLIP peaks (intervals of crosslinks) in BED file
                        format
  -x INPUTXLSITES, --inputxlsites INPUTXLSITES
                        CLIP crosslinks in BED file format
  -g GENOMEFASTA, --genomefasta GENOMEFASTA
                        genome fasta file, ideally the same as was used for
                        read alignment
  -gi GENOMEINDEX, --genomeindex GENOMEINDEX
                        genome fasta index file (.fai)
  -r REGIONS, --regions REGIONS
                        genome segmentation file produced as output of "iCount
                        segment" function

optional arguments:
  -h, --help            show this help message and exit
  -k [{3,4,5,6,7}], --kmerlength [{3,4,5,6,7}]
                        kmer length [DEFAULT 5]
  -o [OUTPUTPATH], --outputpath [OUTPUTPATH]
                        output folder [DEFAULT current directory]
  -w [WINDOW], --window [WINDOW]
                        window around thresholded crosslinks for finding
                        enriched kmers [DEFAULT 20]
  -dw [DISTALWINDOW], --distalwindow [DISTALWINDOW]
                        window around enriched kmers to calculate distribution
                        [DEFAULT 150]
  -t [TOPN], --topn [TOPN]
                        number of kmers ranked by z-score in descending order
                        for clustering and plotting [DEFAULT 20]
  -p [PERCENTILE], --percentile [PERCENTILE]
                        Percentile for considering thresholded crosslinks.
                        Accepts a value between 0 and 1 [0, 1). Percentile 0.7
                        means that a cDNA count threshold is determined at
                        which >=70 percent of the crosslink sites within the
                        region have a cDNA count equal or below the threshold.
                        Thresholded crosslinks have cDNA count above the
                        threshold. [DEFAULT 0.7]
  -c [CLUSTERS], --clusters [CLUSTERS]
                        how many enriched kmers to cluster and plot [DEFAULT
                        5]
  -s [SMOOTHING], --smoothing [SMOOTHING]
                        window used for smoothing kmer positional distribution
                        curves [DEFAULT 6]
  -re [{remove_repeats,masked,unmasked,repeats_only}], --repeats [{remove_repeats,masked,unmasked,repeats_only}]
                        how to treat repeating regions within genome (options:
                        "masked", "unmasked", "repeats_only",
                        "remove_repeats"). When applying any of the options
                        with the exception of repeats == "unmasked", a genome
                        with soft-masked repeat sequences should be used for
                        input, ie. repeats in lowercase letters. [DEFAULT
                        "unmasked"]
  -pos [RELEVANT_POS_THRESHOLD], --relevant_pos_threshold [RELEVANT_POS_THRESHOLD]
                        Percentile to set as threshold for relevant positions.
                        Accepted values are floats between 0 and 99 [0, 99].
                        If threshold is set to 0 then all positions within the
                        set window (-w, default 20 nt) will be considered for
                        enrichment calculation. If threshold is not zero, it
                        will be used to determine relevant positions for
                        enrichment calculation for each k-mer. If the -pos
                        option is not set, then the threshold will be
                        automatically assigned based on k-mer lengthand number
                        of crosslinks in region.
  -relax {True,False}, --relaxed_prtxn {True,False}
                        Whether to relax automatically calculated prtxn
                        threshold or not. Can be 'True' or 'False', default is
                        'True'. If 'True', more positions will be included for
                        PEKA-score calculation across k-mers. Using relaxed
                        threshold is recommended, unless you have data of very
                        high complexity and are using k-mer length <=5. This
                        argument can't be used together with -pos argument,
                        which sets a user-defined threshold for relevant
                        positions. [DEFAULT True]
  -a {True,False}, --alloutputs {True,False}
                        controls the number of outputs, can be True or False
                        [DEFAULT False]
  -sr {genome,whole_gene,intron,UTR3,other_exon,ncRNA,intergenic}, --specificregion {genome,whole_gene,intron,UTR3,other_exon,ncRNA,intergenic}
                        choose to run PEKA on a specific region only, to
                        specify multiple regions enter them space separated
                        [DEFAULT None]
  -sub {True,False}, --subsample {True,False}
                        if the crosslinks file is very large, they can be
                        subsampled to reduce runtime, can be True/False
                        [DEFAULT True]
  -seed {True,False}, --set_seeds {True,False}
                        If you want to ensure reproducibility of results the
                        option set_seeds must be set to True. Can be True or
                        False [DEFAULT True]. Note that setting seeds reduces
                        the randomness of background sampling.

Common issues

The script needs writing permission in the staging directory to save results and make an environment variable TMPDIR for temporary files. If you get KeyError: 'TMPDIR' a solution would be to type export TMPDIR=<path_to_folder> in terminal where you want to run the script.

Outputs

The default outputs produced by PEKA for each specified genomic region are:

If the option ALLOUTPUTS is set to True, the following files are also outputted for each specified genomic region:

Other outputs:

Plotting k-mer relative occurrences in heatmap format (preprint)

To produce k-mer relative occurrence heatmaps as shown in Figure 1g of our paper https://doi.org/10.1186/s13059-022-02755-2, use the script plotRelativeOccurrenceHeatmap.py in peka environment as follows:

# Example of run command
python3 ./src/plotRelativeOccurrenceHeatmap.py \
PathToKmerOccurrenceTable \
PathToKmerRtxnTable \
PathToOutputDirectory \
WindowAroundCrosslinkSite\
NumberOfTopKmers

Arguments

PathToKmerOccurrenceTable - file with k-mer PEKA-scores 5mer_distribution_whole_gene.tsv
PathToKmerRtxnTable - file with k-mer relative occurrences 5mer_rtxn_whole_gene.tsv
PathToOutputDirectory - location to which files will be saved
WindowAroundCrosslinkSite - n nucleotides up and downstream from crosslink to plot
NumberOfTopKmers - n top k-mers to plot

To run on test data head to TestData/RelativeOccurrenceHeatmap