guanjue / S3norm

An normalization method that can normalize high signals without inflate the background noise.
MIT License
7 stars 3 forks source link

How do you determine the value of parameters (p and s) of NB model? #3

Closed fatyang799 closed 1 year ago

fatyang799 commented 2 years ago

Excellent idea of S3norm!

But when I read the methods and code of s3norm, I do not understand the biological meaning of the parameters (p and s) of the NB model.

The p denotes the probability of success. In the sequencing data, what is success? Why do you assign (σ2-M)/σ2 to p? How to understand (σ2-M)/σ2 in sequencing data?

The slocal denotes a shape parameter. In NB model, the s should be the number of successful events. Again, in the sequencing data, what is a successful event? Why do you assign M2/(σ2-M) to p? How to understand M2/(σ2-M) in sequencing data?

guanjue commented 2 years ago

This is just two set of the NB notation. The intuitive interpretation in sequencing read is not very clear (at least to me).

For your question about the formula: For NB model, the M (mean) and σ2 (variance) is defined as (wiki page of NB model is really good) M = ps / (1-p) σ2 = ps / (1-p)^2 So, you can easily get (1-p) = M/σ2 -> p = (σ2-M)/σ2 s = M * (1-p) / p -> s = M2/(σ2-M)

Because greater s can give you less significant p-value, we use it adjust the -log10pvalue by local input signal: s_local = s * ri_input / mean_r_input

Best wishes. Guanjue

fatyang799 commented 2 years ago

Thanks for the reply!

I have known that the p and s are just parameters for NB model. And these are just mathematic formula to calculate parameters:

M = ps / (1-p) σ2 = ps / (1-p)^2

In addition, I have thought about this problem for 2 days. I'd like to share my thoughts with you, and I hope you can provide some comment about my idea. many thx.

In my opinion, parameters in r_i ∼ NB (s, p) are:

But here, I still don't understand why the p0=ps [In your paper, this corresponds to p0=(1-p)s]

Btw, I notice that the definition of p in the paper is different from the p in the script (negative_binomial_neglog10p.R). And I adopt your idea in the script where the definition of p is the probability of fail.

Best,

Liu

fatyang799 commented 2 years ago

Hi Guanjue,

I found another problem. In the paper, the p-value seems to be calculated from the NB background model. I thought you would calculate like this:

f(X=N)=pnbinom(N-1, sig_0_size, sig_0_prob, lower.tail=FALSE)

But in the script file (negative_binomial_neglog10p.R), the p-value derives from the ratio of 2 probability:

pval_new = pnbinom(N-1, sig_0_size, sig_0_prob, lower.tail=FALSE) / pnbinom(0, sig_0_size, sig_0_prob, lower.tail=FALSE) * (l-num_0)/l

According to my understanding:

Are you calculating the pval_new to make all P value comparable? Because the P value in each bin was corrected with various ri_input / mean_r_input.

All of the above are my guesses. And I really want to know why you choose pval_new instead of f(X=N) as output value.

Best,

Liu

guanjue commented 2 years ago

For "pval_new = pnbinom(N-1, sig_0_size, sig_0_prob, lower.tail=FALSE) / pnbinom(0, sig_0_size, sig_0_prob, lower.tail=FALSE) * (l-num_0)/l ":

Here, since our NB model parameter is estimated only by using the non0-bins, we adjusted model output pvalue by ( observed_non0_bin_pvalue / expected_non0_bin_pvalue ). (It can be considered as a prior probability for all non-0 bins) The "expected_non0_bin_pvalue" = pnbinom(0, sig_0_size, sig_0_prob, lower.tail=FALSE) The "observed_non0_bin_pvalue" = (l-num_0)/l; l is the number of all bins, num_0 is the number of 0 bins.

fatyang799 commented 2 years ago

Really great talking with you and appreciate all your help! This makes me have a deeper understanding of s3norm. If I need anything else, I'll start a new issue.

fatyang799 commented 2 years ago

Hi Guanjue,

I got confused about the output of s3norm. I would greatly appreciate if you can answer these questions!

Liu

guanjue commented 2 years ago

Our goal of transforming normalized_rc to -log10pvalue is to adjust the local background signal along the genome in chip-seq data. However, after this transformation, the signal to noise ratio is not the same across different samples, we decided to treat -log10-p as signals and normalized again.

Best wishes. Guanjue