Puzzle about the Figure 1g about Heatmaps showing relative occurrences (RtXn) and PEKA-scores for top 40 k-mers

Linhua-Sun commented 2 years ago

@kkuret Hi: I am trying the peka to performing motif identification from our in-house generated eCLIP-Seq in plants. I found peka is more suitable to peaks identified from CLIP-Seq (unlike those peaks from ChIP-Seq, usually people use MEME-ChIP or HOMER to identify motifs) based on your bioRxiv preprint. Thanks for your great tool! I have tested peka on my data and generated a series of output. A file with suffix '*5mer_distribution_whole_gene.tsv' seems to contain the information like your Figure 1g. I want convert the table into heatmap to better understand your paper. But I am confused about the kmer seqence in the left part of the heatmap with first column in tsv file. How to convert the V1 column into the left part of heatmap. I also choose the top 20 ranked rows based on peka-score.

Another question is how to present the motif enrichment results of CLIP-Seq like those results from ChIP-Seq in a typical experiments centered paper? like A in https://iiif.elifesciences.org/lax/53278%2Felife-53278-fig6-v2.tif/full/1500,/0/default.jpg Do you have any suggestions? Thanks a lot. I am not use what value to show the significance (peka score?).

kkuret commented 2 years ago

Dear @Linhua-Sun ! Thank you for using peka. 1- Heatmap For recreating the plot on Fig1 g you should use the values in the files with extension "5merrtxn.tsv". These values are derived from the raw occurrences reported in the '5mer_distribution*.tsv' by normalizing those raw occurrences with average occurrence in the distal window around crosslink. This normalization is good for plotting heatmaps, because it improves the capacity to compare the enriched positions of different k-mers, as otherwise, the regional genomic differences of k-mers would decrease the visibility of the less-abundant k-mers. To get the lest end of the heatmap the way it is, I cluster motifs by their sequence and order the clusters by decreasing PEKA scores and also edit labels to add padding based on the distance of peak maximum from the crosslink site. I will share the code for re-creating these heatmaps on github, so you can simply re-run it - I will let you know when it's uploaded.

2 - Weblogos There isn't a direct way to convert PEKA output to a weblogo format, as the main objective of the method is to identify enriched k-mers. The indicator of enrichment strength is indeed PEKA-score. A PEKA score of 0 means there is no enrichment of a particular k-mer in the foreground relative to the background. PEKA-score can't be directly linked with p-value in the same way as z-score, because the score distribution is not gaussian, but usually has a heavy tail on the right (several k-mers will have much higher PEKA-scores than the rest). If you wish to add significance (p-value), plot the distribution of PEKA-scores for your data and decide which statistical test is appropriate for you. See examples of PEKA-score kernel density estimate distribution (x-axis) for 2 different eCLIPs:

To get this output with other methods (such as those included in the MEME suite), you could use thresholded crosslinks to extract the foreground sequences and the background crosslinks (oxn) to extract the background sequences and use those as inputs to MEME/STREME. I haven't attempted that yet, so I don't know how the results would turn out, but conceptually this approach would mimick what PEKA is doing.

For our paper we generated sequence logos with sequence-based clustering of top k-mers to generate k-mer alignments and used those alignment derived PWMs as input to seqlogo module, which plots them. However, these logos only serve to visually summarize a k-mer cluster and don't have any significance linked to them.

Hope this was helpful!

Linhua-Sun commented 2 years ago

Thank you very much for your prompt reply. I'll take a closer look at the relevant details.

kkuret commented 2 years ago

@Linhua-Sun I added the script to produce heatmap of relative k-mer occurrences. You can find the instructions on how to run it in our README file. Let me know if you encounter issues.

Linhua-Sun commented 2 years ago

Thanks again!

ulelab / peka

Puzzle about the Figure 1g about Heatmaps showing relative occurrences (RtXn) and PEKA-scores for top 40 k-mers #12