tleonardi / nanocompore

RNA modifications detection from Nanopore dRNA-Seq data
https://nanocompore.rna.rocks
GNU General Public License v3.0
78 stars 12 forks source link

Interpreting results from ‘nanocompore sampcomp’ #123

Closed Huanle closed 4 years ago

Huanle commented 4 years ago

Hi @tleonardi ,

Thanks heapd for developing nanocompore and thanks a lot in advance for your help. My question is related to dealing with the resutls from nanocompore sampcomp. Can you detail the meaning of all columns ? Which parameter can i rely on to choose the most likely modified candidates? Thanks again.

tleonardi commented 4 years ago

Hi @Huanle, here's a description of each column (sorry, this should be in the docs)

The columns that follow report the p-values for the tests that you specified on the command line. These can be:

For more information on the statistics you can find an explanation in the section "5.3.2 Statistical analysis" of our biorxiv preprint. After each _pvalue column there's also a _context_X column (where X is the number specified with the --sequence_context option) that reports the p-value obtained after aggrating the p-value of +/-X neighboring kmers with Hou's method (see "5.3.2 Statistical analysis" for details).

If the GMM option was specified on the command line, you will also have these columns:

Now, for you last question about which parameters to use: I don't think there's an absolute answer and this likely depends on the modification under investigation and the specific experimental setting. This is essentially why we decided to implement multiple tests in nanocompore. Having said this, in our experience with small METTL3 knock-down experiments with 2 or 3 replicates we obtained the best results the the GMM_logit method. The non-parametric tests on current or dwell time gave too many false positive hits, whereas with the GMM_anova method we didn't have enough statistical power with so few replicates. I hope this answers your question, and if now I'm happy to provide more information!

Huanle commented 4 years ago

Hi @tleonardi ,

Thanks very much for your explanation. One more question concerning the output: Below is a snapshot of 'out_nanocompore_results.tsv'. As you can see columns Chr, genomicPos and strand have no relevant information. I have a quick examination and found the values in the pos column are actually the start index of the ref_kmer. Do you know why? Thanks a lot.

pos     chr     genomicPos      ref_id  strand  ref_kmer
1138    NA      NA      Contig1     NA      GGCGA
1139    NA      NA      Contig1     NA      GCGAT
1141    NA      NA      Contig1     NA      GATGC
a-slide commented 4 years ago

Hi @Huanle, These values are only populated if you run sampcomp with --bed option, which allows Nanocompore to transform the transcripts coordinates in the genomic space.

--bed = BED file with annotation of transcriptome used for mapping (optional)

Cheers

tleonardi commented 4 years ago

This has probably happened because you didn't provide a bed file with the --bed argument when running Nanocompore. This argument refers to a bed file with the annotation of the transcriptome (i.e. transcriptome used for mapping) in genomic coordinates. Although this file is not required to run Nanocompore, it is needed to convert coordinated from transcriptome-space to genome-space.

Huanle commented 4 years ago

Hi @tleonardi @a-slide ,

Thanks heaps for your prompt answers, which definitely clear my confusion.