tleonardi / nanocompore

RNA modifications detection from Nanopore dRNA-Seq data
https://nanocompore.rna.rocks
GNU General Public License v3.0
78 stars 12 forks source link

Use a peak calling algorithm to call significant positions from the pValue graph #95

Open a-slide opened 5 years ago

a-slide commented 5 years ago

At the moment Nanocompore reports all the positions bellow the pvalue threshold, but when a position is significant the adjacent kmers are often significant as well. Using a peak calling algorithm would allow to refine the position of the modified residue

tleonardi commented 4 years ago

This is something to consider for a future update. I've added it to the Version 2 milestone.

a-slide commented 4 years ago

I have got a prototype that seems to be working alright. See example after: NC_peakcalling

a-slide commented 4 years ago

Ideally, it could be done directly in TxComp, but would require TxComp to be able to correct pValues without knowing all the pvalues. Random permutations and empirical pValues?

mbhall88 commented 2 years ago

Hey guys. In the MetaCompore pipeline, there is a nanocompore postprocessing script that calls peaks using scipy's find_peaks.

I'm just trying to wrap my head around what the presence/absence of a peak means?

For example,

ref_id  pos     pvalue  peak
ATCC11775_23S_rRNA      1132    0.0085561746316739      False
ATCC11775_23S_rRNA      1144    0.0023775255077 False
ATCC11775_23S_rRNA      1612    0.0086023977807072      False
ATCC11775_23S_rRNA      1613    2.803727380359832e-06   True
ATCC11775_23S_rRNA      1614    0.0005099579444844      False
ATCC11775_23S_rRNA      1615    0.000140505982829       False
ATCC11775_23S_rRNA      1616    0.0029064514858123      False
ATCC11775_23S_rRNA      1625    0.0049815292659649      False
ATCC11775_23S_rRNA      1626    0.0003991030640439      False
ATCC11775_23S_rRNA      1627    0.000140505982829       True
ATCC11775_23S_rRNA      1628    0.0052356375331268      False

is the idea that I should really ignore the False peaks because they are surrounded by other significant sites? So in the above example, position 1613 should be used and 1614-16 be ignored? If so, what about 1132 where it is significant, has a False peak, but has no surrounding positions that are significant?

tleonardi commented 2 years ago

Hi, yes that's the fundamental ideal. When you have neighboring significant kmers, the peak is the most significant one. It's just a handy way of collapsing the effect of a modification (which can impact multiple neighboring kmers) to a single kmer. Just keep in mind that this doesn't mean that the modification is the central base of the peak kmer. Not sure why 1132 is not marked as a peak.. can you share the input file?

mbhall88 commented 2 years ago

outnanocompore_results.tsv.gz