fatyang799 commented 9 months ago

Hi Guanjue,

I hope this message finds you well. I recently had the opportunity to read your papers again, including:

I also took the time to work with the source code. And I encountered some difficulties and observed a few behaviors that were not entirely clear based on the papers. I have spent considerable time reviewing the source code to ensure that my understanding aligns with the intended functionality. However, I am left with a few uncertainties that I am hoping you might help clarify.

I don't understand why you fit the data after performing the S3norm to a negative binomial distribution and use the -log10Pvalue as the signal value. Also, as I read your source code, there are some things I don't quite understand.

global_nbp_NB_cm.ok.R:

line1 | get_true_NB_prob_size = function(x, siglim){
line2 |     x_pass_lim = x[x>=siglim]
line3 |     m=mean(x_pass_lim)
line4 |     v=var(x_pass_lim)
line5 |     print('old m & v')
line6 |     print(c(m, v))
line7 |     passlim_num = sum(x>=siglim)
line8 |     for (i in 1:1){
line9 |         if (v<m){v=m+0.1}
line10|         m_pre = m
line11|         v_pre = v
line12|         p = m/v
line13|         s = m^2/(v-m)
line14|         exp_siglim_p = dnbinom(0:siglim, s, p)
line15|         exp_total_num = passlim_num/(1-sum(exp_siglim_p))
line16|         ### get new mean
line17|         siglim_n_sum_for_m = 0
line18|         for (j in 0:siglim){
line19|             siglim_n_sum_for_m = siglim_n_sum_for_m + j* exp_siglim_p[j+1]*exp_total_num
line20|         }
line21|         m = (siglim_n_sum_for_m + m * passlim_num) / exp_total_num
line22| 
line23|         ### get new var
line24|         siglim_n_sum_for_v = 0
line25|         for (j in 0:siglim){
line26|             siglim_n_sum_for_v = siglim_n_sum_for_v + (j-m)^2 * exp_siglim_p[j+1]*exp_total_num
line27|         }
line28|         v = (siglim_n_sum_for_v + sum((x_pass_lim-m)^2)) / (exp_total_num-1)
line29|         if (v<m){v=m+0.1}
line30|         print(c(i,m,v))
line31|         if(abs(m-m_pre)<0.001 & abs(v-v_pre)<0.00001) {break}
line32|     }
line33|     p = m/v
line34|     s = m^2/(v-m)
line35|     print('new m & v')
line36|     print(c(m, v))
line37|     return(c(p,s))
line38| }
line39| }

line8: for (i in 1:1), should this be for (i in 1:100)?
line12 and line33: p = m/v, according to the formula in the articles, calculating the p-value should be: p = (v-m)/v? (I am not sure about this)
line14-line28: I don't understand how to calculate the parameters of NB distribution. (..@_@|||||..)

I am aware that these issues may arise from my own misinterpretation, and I am eager to learn from your expertise. Could you please provide some guidance on these matters? I would greatly appreciate your insights.

Thank you for considering my inquiry. I look forward to your response and am happy to provide any further details should you require them.

guanjue commented 9 months ago

Here, given some epigenomic datasets have inflated zeros, we decided to use non-zero signals to estimate the NB distribution.

Q1, I don't understand why you fit the data after performing the S3norm to a negative binomial distribution and use the -log10Pvalue as the signal value. This is mainly for the downstream IDEAS genome segmentation analysis, in practice, we found converting reads count to -log10-p-value can generate the epigenetic states profile more similar to what people expect to see.

Q2, [line12 and line33: p = m/v, according to the formula in the articles, calculating the p-value should be: p = (v-m)/v? (I am not sure about this)](line8: for (i in 1:1), should this be for (i in 1:100)?) If observed non-zero signals indeed follow the NB distribution, theoretically, this function should converge to a more reliable NB distribution. However, in practice, the estimation in practice is not very stable. In some individual datasets, after a few rounds, the NB mean and sd may become too large or too small for converting observed read counts to -log10-p-value. Therefore, we decided to only estimate NB 1 time for each individual -log10pvalue calculation.

Q3, line12 and line33: p = m/v, according to the formula in the articles, calculating the p-value should be: p = (v-m)/v? (I am not sure about this) The line12 and line33: p = m/v, according to the formula in the articles, calculating the p-value should be: p = (v-m)/v? (I am not sure about this) Since we are using the dnbinom function to get the density function, we used their notation (https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/NegBinomial) in this script. In this link, the dnbinom defines the m = n(1-p) and v = n(1-p) / p^2, thus n = m/v, p = m^2/(v-m)

Q4, line14-line28: I don't understand how to calculate the parameters of NB distribution. (..@_@|||||..) 1, exp_siglim_p is the expected probability of non-zero bins based on initial NB distribution with initial m and v: p = m/v and s = m^2/(v-m) 2, passlim_num is the osberved number of non-zero bins. 3, Since NB distribution is a discrete distribution, then you can estimate the expected total number of bins (zeron-bins + non-zero bins) based on this formula: exp_total_num = passlim_num/(1-sum(exp_siglim_p)) 4, Then, to calculate the new mean based on the initial NB distribution, at each count data point (include zero), we use the for loop to first get the sum: siglim_n_sum_for_m = 0 for (j in 0:siglim){

Based on the initial NB distribution, for all data points equal to count j, their sum should be j exp_siglim_p[j+1]exp_total_num

siglim_n_sum_for_m = siglim_n_sum_for_m + j* exp_siglim_p[j+1]*exp_total_num

} 5, Then we get the new mean: based on (siglim_n_sum_for_m + m passlim_num) / exp_total_num. If I remember correctly, here, we added "m passlim_num" because in practice this method's estimated NB distribution is not sufficient to reduce the background noise in some datasets which will create epigenetic states that are not biologically meaningful in the downstream analysis, so we decided to increase the estimate mean for estimated NB model. Adding this term can double the estimated mean. 6, A similar calculation is also done for variance.

fatyang799 commented 9 months ago

Hi @guanjue ,

I hope this message finds you well. First and foremost, I would like to express my gratitude for your prompt and thorough response to my questions. I truly appreciate the time and effort you took to provide such a detailed explanation.

I have gained a deeper understanding of the computational principles behind the negative binomial distribution after studying the materials you provided and I am now fully able to comprehend your calculation process.

While I have a better understanding of the calculation process, I still have a lingering question and concern outside of the calculation (in terms of biological significance). I would be grateful if you could provide further clarification. I understand that you may be busy, but your additional insights would be of great help to me and possibly other users as well.

Regarding the [Q1. I don't understand why you fit the data after performing the S3norm to a negative binomial distribution and use the -log10Pvalue as the signal value]:

Why doesn't the normalized data from S3V2 match the expectations of users? What data distribution should people expect to see?

Besides this, I have some doubts about the output results of the software. I know that there are 2 main output directories:

${outprefix}_RC/: the S3V2 normalized data (${cell}_${id}.${mk}.S3V2.bedgraph) in bigwig format.
${outprefix}_NBP/: the -log10Pvalue data obtained by fitting a negative binomial distribution to S3V2 normalized data (${cell}_${id}.${mk}.S3V2.bedgraph.NBP.bedgraph)

In practical applications, what are the uses of the S3V2 normalized data (${cell}_${id}.${mk}.S3V2.bedgraph) and the -log10Pvalue data obtained by fitting a negative binomial distribution to these data? Specifically, I know that -log10P data is applicable to IDEAS analysis. However, if you perform differential signal analysis or calculate log2FC, can you still use it? Is the S3V2 normalized data (${cell}_${id}.${mk}.S3V2.bedgraph) more suitable?

guanjue commented 9 months ago

(1) For sequencing data people usually expect to fit a negative binomial distribution.

(2) .S3V2.bedgraph files are just saved in the output fold for the record. If it eats too much space, feel free to delete them We didn't systematically evaluate the performance of S3V2 normalized data in differential analysis. It may create some unexpected signals due to the non-linear transformation which may not fit well with the background model of differential analysis methods.

Best wishes. Guanjue

fatyang799 commented 9 months ago

Thank you for your patience and quick response!

guanjue / S3V2_IDEAS_ESMP

Inquiry Regarding NB distribution and Associated Code #11

Based on the initial NB distribution, for all data points equal to count j, their sum should be j exp_siglim_p[j+1]exp_total_num