How do you determine the value of parameters (p and s) of NB model?

fatyang799 commented 2 years ago

Excellent idea of S3norm!

But when I read the methods and code of s3norm, I do not understand the biological meaning of the parameters (p and s) of the NB model.

The p denotes the probability of success. In the sequencing data, what is success? Why do you assign (σ2-M)/σ2 to p? How to understand (σ2-M)/σ2 in sequencing data?

The s_local denotes a shape parameter. In NB model, the s should be the number of successful events. Again, in the sequencing data, what is a successful event? Why do you assign M2/(σ2-M) to p? How to understand M2/(σ2-M) in sequencing data?

guanjue commented 2 years ago

This is just two set of the NB notation. The intuitive interpretation in sequencing read is not very clear (at least to me).

For your question about the formula: For NB model, the M (mean) and σ2 (variance) is defined as (wiki page of NB model is really good) M = ps / (1-p) σ2 = ps / (1-p)^2 So, you can easily get (1-p) = M/σ2 -> p = (σ2-M)/σ2 s = M * (1-p) / p -> s = M2/(σ2-M)

Because greater s can give you less significant p-value, we use it adjust the -log10pvalue by local input signal: s_local = s * ri_input / mean_r_input

Best wishes. Guanjue

fatyang799 commented 2 years ago

Thanks for the reply!

I have known that the p and s are just parameters for NB model. And these are just mathematic formula to calculate parameters:

M = ps / (1-p) σ2 = ps / (1-p)^2

In addition, I have thought about this problem for 2 days. I'd like to share my thoughts with you, and I hope you can provide some comment about my idea. many thx.

In my opinion, parameters in r_i ∼ NB (s, p) are:

I regarded p as the probability of a read not falling into a bin. And the p is same for all bins.
I regarded s as the number of reads not falling into a bin.
Then the NB model describe the distribution of the number of reads falling into a bin. According to the NB model, we can calculate the probability of every signal value in a bedgraph file (s3norm normalized count).

But here, I still don't understand why the p0=p^s [In your paper, this corresponds to p0=(1-p)^s]

Btw, I notice that the definition of p in the paper is different from the p in the script (negative_binomial_neglog10p.R). And I adopt your idea in the script where the definition of p is the probability of fail.

Best,

Liu

fatyang799 commented 2 years ago

Hi Guanjue,

I found another problem. In the paper, the p-value seems to be calculated from the NB background model. I thought you would calculate like this:

f(X=N)=pnbinom(N-1, sig_0_size, sig_0_prob, lower.tail=FALSE)

But in the script file (negative_binomial_neglog10p.R), the p-value derives from the ratio of 2 probability:

pval_new = pnbinom(N-1, sig_0_size, sig_0_prob, lower.tail=FALSE) / pnbinom(0, sig_0_size, sig_0_prob, lower.tail=FALSE) * (l-num_0)/l

According to my understanding:

pnbinom(N-1, sig_0_size, sig_0_prob, lower.tail=FALSE) / pnbinom(0, sig_0_size, sig_0_prob, lower.tail=FALSE): this represents the ratio of the probability that the count equals N to the probability that the count equals 1.
(l-num_0)/l: It denotes the proportion of non-zero bins.

Are you calculating the pval_new to make all P value comparable? Because the P value in each bin was corrected with various ri_input / mean_r_input.

All of the above are my guesses. And I really want to know why you choose pval_new instead of f(X=N) as output value.

Best,

Liu

guanjue commented 2 years ago

For "pval_new = pnbinom(N-1, sig_0_size, sig_0_prob, lower.tail=FALSE) / pnbinom(0, sig_0_size, sig_0_prob, lower.tail=FALSE) * (l-num_0)/l ":

Here, since our NB model parameter is estimated only by using the non0-bins, we adjusted model output pvalue by ( observed_non0_bin_pvalue / expected_non0_bin_pvalue ). (It can be considered as a prior probability for all non-0 bins) The "expected_non0_bin_pvalue" = pnbinom(0, sig_0_size, sig_0_prob, lower.tail=FALSE) The "observed_non0_bin_pvalue" = (l-num_0)/l; l is the number of all bins, num_0 is the number of 0 bins.

fatyang799 commented 2 years ago

Really great talking with you and appreciate all your help! This makes me have a deeper understanding of s3norm. If I need anything else, I'll start a new issue.

fatyang799 commented 2 years ago

Hi Guanjue,

I got confused about the output of s3norm. I would greatly appreciate if you can answer these questions!

Why do you use p-values as signals for downstream analysis? What is your rationale/assumption for this? (Intuitively, I might use normalized_RC data to do downstream analysis, such as genome segmentation.)
Why can you re-run s3norm for the p-values. I guess you treat the p-values as signals. Again, this goes back to the first question, What is your rationale/assumption for this? Best,

Liu

guanjue commented 2 years ago

Our goal of transforming normalized_rc to -log10pvalue is to adjust the local background signal along the genome in chip-seq data. However, after this transformation, the signal to noise ratio is not the same across different samples, we decided to treat -log10-p as signals and normalized again.

Best wishes. Guanjue

guanjue / S3norm

How do you determine the value of parameters (p and s) of NB model? #3