jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

on determining σ for p-value calculation #88

Closed malcook closed 2 years ago

malcook commented 2 years ago

p-value calculation documents that Genrich uses a heuristic to estimate σ.

Genrich could potentially compute an empirical sample σ from the data at hand.

Could this potentially be made a new Genrich option? (e.g. -S compute p-values using sample σ rather than a function of μ (takes longer))

FWIW: I have an example where the heuristic evaluates to 1.665195 which overestimates the empirical sample σ of 1.185541.

On a similar note, might it make sense, after deriving μ and σ (by either current or proposed method) to further exclude from consideration loci where all sample's pileup value fail to exceed some user-supplied threshold? This might dramatically alter the adjusted p-values. Or might such be considered 'p-hacking'?

malcook commented 2 years ago

Additionally, can you please advise how Genrich handles experimental pileup values of zero, given that log normal distribution is not defined at zero?

jsh58 commented 2 years ago

Thanks for the questions. As you noted, the current calculation of σ is a function of μ, and therefore it varies from position to position in the genome as the control/background pileup value changes. This was derived by analyzing pileup value distributions from multiple control samples. It seems that you are suggesting using a single fixed value for σ, but this is inconsistent with what I have observed.

Experimental pileup values of 0 lead to p-values of 1. By definition, if there are no reads, there cannot be any enrichment.

malcook commented 2 years ago

However, since my example is of ATAC-seq, there is no control and therefore μ does not vary by position (being the global background) and so neither does σ vary by position.

Right?

jsh58 commented 2 years ago

Yes, if there is no control, the "control/background pileup value" is just the background pileup value.