tleonardi / nanocompore

RNA modifications detection from Nanopore dRNA-Seq data
https://nanocompore.rna.rocks
GNU General Public License v3.0
78 stars 12 forks source link

nanocompore sampcomp output question. #219

Closed fgfrost closed 1 year ago

fgfrost commented 1 year ago

Hi, I ran sampcomp on some example data I got from GEO and I have a few questions about the output that I can't find answers for in the documentation.

Here's some sample output for reference:

  pos chr genomicPos               ref_id strand ref_kmer GMM_logit_pvalue KS_dwell_pvalue KS_intensity_pvalue GMM_cov_type GMM_n_clust                             cluster_counts           Logit_LOR
1  21  NA         NA ENSMUST00000082392.1     NA    ACACT     3.717418e-05    0.9878572042        6.149483e-06         full           2  Mettl3_WT_1:162/829__Mettl3_KO_1:391/3439  0.5442894105931009
2  25  NA         NA ENSMUST00000082392.1     NA    TCCTC     4.255243e-03    0.0003578062        1.695496e-01         full           2 Mettl3_WT_1:436/587__Mettl3_KO_1:1389/2536 0.30488478067150665
3  33  NA         NA ENSMUST00000082392.1     NA    CCCAT     4.200304e-03    0.0328759602        1.416333e-01         full           2  Mettl3_WT_1:176/879__Mettl3_KO_1:469/3519  0.4097113989896163
4  38  NA         NA ENSMUST00000082392.1     NA    TCTAA     1.335772e-14    0.1819068983        1.559103e-12         full           2  Mettl3_WT_1:320/768__Mettl3_KO_1:714/3388   0.682357782537413
5  41  NA         NA ENSMUST00000082392.1     NA    AATCG     7.497882e-03    0.0011759232        8.640439e-01         full           2 Mettl3_WT_1:556/517__Mettl3_KO_1:1822/2247 0.28214743400271897

My questions are about the cluster_counts and Logit_LOR fields, namely, what specifically do they describe? More specifically:

  1. What are the numerator and denominator in the cluster_counts field? I want to get an idea of the modified bases and unmodified bases at a given position, and I think this is the field that conveys that, put sometime the numerator is greater than the denominator. So does that mean the numerator is modified bases and the denominator is unmodified bases?
  2. Assuming that Logit_LOR stands for log odds ratio, what is this odds ratio specifically? does it convey ratio of modification, or is it simply another confidence statistic similar to the p value?
lmulroney commented 1 year ago

Hi @fgfrost,

Your question about the cluster counts is really close to this already answered issues #210. But here is excerpt from that answer to help with understanding how to read the cluster counts line "The basic way to read those lines is: [sample name]:[number of reads assigned to cluster 1]/[number of reads assigned to cluster 2]__repeated for each further sample." - #210

The lor is indeed the log odds ratio, here is a quote from our recent protocol paper to explain the lor "The absolute value of LOR 0.5 means that the probability of a read from one condition is ∼3 times more likely in one cluster than a read from the other condition. A larger absolute value of LOR means that reads from one condition are more heavily enriched in one cluster, and an absolute value of LOR closer to 0 means that the two sample labels are more evenly distributed between the two clusters." Mulroney et al Current Protocols (2023) https://doi.org/10.1002/cpz1.683

fgfrost commented 1 year ago

Thank you! That's very helpful. So my understanding is that there is no inherent assignment between clusters and modification status, i.e. cluster 1 does not inherently mean modified or unmodified. Is that correct? and if that's the case does cluster 1 (whether unmodified or modified) correspond to the same modification state in each sample?

lmulroney commented 1 year ago

Hi @fgfrost,

The cluster number is completely random from position to position and does not correspond to modified or unmodified state.

Cheers, Logan

fgfrost commented 1 year ago

Got it, thank you so much for answering all of my questions!!

lmulroney commented 1 year ago

Hi @fgfrost,

Glad I could help, Logan