Closed Huanle closed 4 years ago
Hi @Huanle, here's a description of each column (sorry, this should be in the docs)
pos
: kmer position in transcript coordinateschr
: reference chromosome genomicPos
: kmer position in genome coordinates (relative to the BED file provided on the command line)ref_id
: transcript idstrand
: genomic strand of the transcriptref_kmer
: Kmer sequenceThe columns that follow report the p-values for the tests that you specified on the command line. These can be:
For more information on the statistics you can find an explanation in the section "5.3.2 Statistical analysis" of our biorxiv preprint. After each _pvalue column there's also a _context_X column (where X is the number specified with the --sequence_context
option) that reports the p-value obtained after aggrating the p-value of +/-X neighboring kmers with Hou's method (see "5.3.2 Statistical analysis" for details).
If the GMM option was specified on the command line, you will also have these columns:
GMM_cov_type
: indicates the type of covariance used for the GMM. In the current version of nanocompore we only support type full (see here for more information)GMM_n_clust
: number of components of the optimal (i.e. lowest BIC) GMM. This can be interpreted as the number of clusters found at this position (in the current version of nanocompore can only be 1 or 2).cluster_counts
: string reporting the number of reads counted in each cluster for each sampleAnova_delta_logit
: delta log odds ratio, i.e. the difference of the means of the log odds of data points belonging to cluster one between the two condition. This can be interpreted as the magnitude of the shift of reads from one cluster to the other between the two conditions (not to be confused with the distance between the two clusters on the current/dwell time plane)Logit_LOR
: same as above, but for the logit method, i.e. merging replicates rather than averaging them.Now, for you last question about which parameters to use: I don't think there's an absolute answer and this likely depends on the modification under investigation and the specific experimental setting. This is essentially why we decided to implement multiple tests in nanocompore. Having said this, in our experience with small METTL3 knock-down experiments with 2 or 3 replicates we obtained the best results the the GMM_logit method. The non-parametric tests on current or dwell time gave too many false positive hits, whereas with the GMM_anova method we didn't have enough statistical power with so few replicates. I hope this answers your question, and if now I'm happy to provide more information!
Hi @tleonardi ,
Thanks very much for your explanation. One more question concerning the output: Below is a snapshot of 'out_nanocompore_results.tsv'. As you can see columns Chr, genomicPos and strand have no relevant information. I have a quick examination and found the values in the pos column are actually the start index of the ref_kmer. Do you know why? Thanks a lot.
pos chr genomicPos ref_id strand ref_kmer
1138 NA NA Contig1 NA GGCGA
1139 NA NA Contig1 NA GCGAT
1141 NA NA Contig1 NA GATGC
Hi @Huanle,
These values are only populated if you run sampcomp with --bed
option, which allows Nanocompore to transform the transcripts coordinates in the genomic space.
--bed = BED file with annotation of transcriptome used for mapping (optional)
Cheers
This has probably happened because you didn't provide a bed file with the --bed
argument when running Nanocompore. This argument refers to a bed file with the annotation of the transcriptome (i.e. transcriptome used for mapping) in genomic coordinates. Although this file is not required to run Nanocompore, it is needed to convert coordinated from transcriptome-space to genome-space.
Hi @tleonardi @a-slide ,
Thanks heaps for your prompt answers, which definitely clear my confusion.
Hi @tleonardi ,
Thanks heapd for developing nanocompore and thanks a lot in advance for your help. My question is related to dealing with the resutls from nanocompore sampcomp. Can you detail the meaning of all columns ? Which parameter can i rely on to choose the most likely modified candidates? Thanks again.