Closed malcook closed 2 years ago
Additionally, can you please advise how Genrich handles experimental pileup values of zero, given that log normal distribution is not defined at zero?
Thanks for the questions. As you noted, the current calculation of σ is a function of μ, and therefore it varies from position to position in the genome as the control/background pileup value changes. This was derived by analyzing pileup value distributions from multiple control samples. It seems that you are suggesting using a single fixed value for σ, but this is inconsistent with what I have observed.
Experimental pileup values of 0 lead to p-values of 1. By definition, if there are no reads, there cannot be any enrichment.
However, since my example is of ATAC-seq, there is no control and therefore μ does not vary by position (being the global background) and so neither does σ vary by position.
Right?
Yes, if there is no control, the "control/background pileup value" is just the background pileup value.
p-value calculation documents that Genrich uses a heuristic to estimate σ.
Genrich could potentially compute an empirical sample σ from the data at hand.
Could this potentially be made a new Genrich option? (e.g.
-S compute p-values using sample σ rather than a function of μ (takes longer)
)FWIW: I have an example where the heuristic evaluates to 1.665195 which overestimates the empirical sample σ of 1.185541.
On a similar note, might it make sense, after deriving μ and σ (by either current or proposed method) to further exclude from consideration loci where all sample's pileup value fail to exceed some user-supplied threshold? This might dramatically alter the adjusted p-values. Or might such be considered 'p-hacking'?