Interpreting results from ‘nanocompore sampcomp’

Huanle commented 4 years ago

Hi @tleonardi ,

Thanks heapd for developing nanocompore and thanks a lot in advance for your help. My question is related to dealing with the resutls from nanocompore sampcomp. Can you detail the meaning of all columns ? Which parameter can i rely on to choose the most likely modified candidates? Thanks again.

tleonardi commented 4 years ago

Hi @Huanle, here's a description of each column (sorry, this should be in the docs)

pos: kmer position in transcript coordinates
chr: reference chromosome
genomicPos: kmer position in genome coordinates (relative to the BED file provided on the command line)
ref_id: transcript id
strand: genomic strand of the transcript
ref_kmer: Kmer sequence

The columns that follow report the p-values for the tests that you specified on the command line. These can be:

GMM+logit
GMM+anova
Kolmogorov-Smirnov (KS) on intensity
Kolmogorov-Smirnov (KS) on dwell time
Mann-Whitney (MW) on intensity
Mann-Whitney (MW) on dwell time
t-test on intensity
t-test on dwell time

For more information on the statistics you can find an explanation in the section "5.3.2 Statistical analysis" of our biorxiv preprint. After each _pvalue column there's also a _context_X column (where X is the number specified with the --sequence_context option) that reports the p-value obtained after aggrating the p-value of +/-X neighboring kmers with Hou's method (see "5.3.2 Statistical analysis" for details).

If the GMM option was specified on the command line, you will also have these columns:

GMM_cov_type: indicates the type of covariance used for the GMM. In the current version of nanocompore we only support type full (see here for more information)
GMM_n_clust: number of components of the optimal (i.e. lowest BIC) GMM. This can be interpreted as the number of clusters found at this position (in the current version of nanocompore can only be 1 or 2).
cluster_counts: string reporting the number of reads counted in each cluster for each sample
Anova_delta_logit: delta log odds ratio, i.e. the difference of the means of the log odds of data points belonging to cluster one between the two condition. This can be interpreted as the magnitude of the shift of reads from one cluster to the other between the two conditions (not to be confused with the distance between the two clusters on the current/dwell time plane)
Logit_LOR: same as above, but for the logit method, i.e. merging replicates rather than averaging them.

Now, for you last question about which parameters to use: I don't think there's an absolute answer and this likely depends on the modification under investigation and the specific experimental setting. This is essentially why we decided to implement multiple tests in nanocompore. Having said this, in our experience with small METTL3 knock-down experiments with 2 or 3 replicates we obtained the best results the the GMM_logit method. The non-parametric tests on current or dwell time gave too many false positive hits, whereas with the GMM_anova method we didn't have enough statistical power with so few replicates. I hope this answers your question, and if now I'm happy to provide more information!

Huanle commented 4 years ago

Hi @tleonardi ,

Thanks very much for your explanation. One more question concerning the output: Below is a snapshot of 'out_nanocompore_results.tsv'. As you can see columns Chr, genomicPos and strand have no relevant information. I have a quick examination and found the values in the pos column are actually the start index of the ref_kmer. Do you know why? Thanks a lot.

pos     chr     genomicPos      ref_id  strand  ref_kmer
1138    NA      NA      Contig1     NA      GGCGA
1139    NA      NA      Contig1     NA      GCGAT
1141    NA      NA      Contig1     NA      GATGC

a-slide commented 4 years ago

Hi @Huanle, These values are only populated if you run sampcomp with --bed option, which allows Nanocompore to transform the transcripts coordinates in the genomic space.

--bed = BED file with annotation of transcriptome used for mapping (optional)

Cheers

tleonardi commented 4 years ago

This has probably happened because you didn't provide a bed file with the --bed argument when running Nanocompore. This argument refers to a bed file with the annotation of the transcriptome (i.e. transcriptome used for mapping) in genomic coordinates. Although this file is not required to run Nanocompore, it is needed to convert coordinated from transcriptome-space to genome-space.

Huanle commented 4 years ago

Hi @tleonardi @a-slide ,

Thanks heaps for your prompt answers, which definitely clear my confusion.

tleonardi / nanocompore

Interpreting results from ‘nanocompore sampcomp’ #123