jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

Calculation of the p-value for a comprehensive calibration of minimum AUC #75

Closed lilian18100 closed 1 year ago

lilian18100 commented 3 years ago

Hello @jsh58 ,

Thank you for the great tool,

I have two samples treated with active molecules and two samples non-treated, being run separately and without control. Average fragment size = 121 bp (similar in all the samples) Average background pile-up = 4.51 (similar in all the samples)

In order to calibrate the minimum AUC to my sample, I am trying to figure out which value of experimental pile-up would be significant, with the default threshold (p-value < 0.01).

I was trying to reproduce its calculation from the source code available here

However, I did not succeed to reproduce your estimation from #11 :

( " As a simple example, the -log(p) value for an experimental pileup value of 15 and a control pileup value of 5 is 1.29, but it increases to 1.38 with pileups of 30 and 10 (twice the depth). "

My thoughts to calibrate the minimum AUC for my sample would be to make sure that half the nucleotides (or a third) of a selected site for peak calling would be significantly enriched. As that means for a nucleotide that -log(p) > 2, I thought then that setting -a 150 would make sense. Would it be a correct interpretation? Are only above threshold values added in the AUC calculation of a region?

Another approach would be to calibrate the p-value threshold, in order to select the significativity of a region to be significatively enriched from 2 or 3 times the background pileup value. (In homology to the microarray noise reduction consisting of taking three times the basal concentration and subtract it everywhere else as security toward false discovery). What do you think about it? Would it make sense to make an approach like thisthis?

My R code intent to calculate the p-value from a log-normal distribution :


MLN_10 =2.30258509299404568402

mu = 5
sd = 1.2*mu

meanlog = log(mu) - log(sqrt(sd))
sdlog = sqrt(log(sd))

-pnorm(((log(15)-meanlog)/sdlog), log.p = TRUE, lower.tail=FALSE) / MLN_10

mu = 10
sd = 12
meanlog = log(mu / sqrt(sd + mu))
sdlog = sqrt(log1p(sd / mu))

-pnorm(((log(30)-meanlog)/sdlog), log.p = TRUE, lower.tail=FALSE) / MLN_10

command for Genrich : 

../../Genrich-master/Genrich  -t sample.bam  \
-o sample.narrowPeak -e chloroplast,mitochondrie,Chr00 -a 150 -r -j

Thank you for your great help.

jsh58 commented 3 years ago

The AUC calculation is described in the -a <float> section here; see also Figure 1 for a graphical representation.

For example, with -p0.01 (meaning -log(p)>2) and -a150, a peak would be called under any of these circumstances:

To calculate the p-value from a log-normal distribution in R, use plnorm().