kordk / torch-ecpg

(GPU accelerated) eCpG mapper
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

MLR: Not able to reproduce estimates & p-values in R #26

Closed rituroy closed 1 year ago

rituroy commented 1 year ago

Used "dummy" data consisting of 300 methylation loci, 300 expression genes and 100 samples. Ran linear regression for each methylation - gene expression pair with the following model: expression ~ methlylation + age +sex Seems like "estimate" being returned is from the "intercept" term, "standard error" & "t_statistics" are from the "methylation" term. P-value does not match that of "methylation" term.

Output from tecpg: gt_site mt_site mt_est mt_err mt_t mt_p 1 ILMN_001 cg001 25.55615 10.433939 1.8765793 0.06933532

Output from R: Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.5561 11.5837 2.206 0.0298 * meth 19.5802 10.4339 1.877 0.0636 . age 0.2658 0.2071 1.284 0.2024
sex 5.4035 5.7596 0.938 0.3505
Residual standard error: 28.67 on 96 degrees of freedom Multiple R-squared: 0.05588, Adjusted R-squared: 0.02638 F-statistic: 1.894 on 3 and 96 DF, p-value: 0.1357

scatterPlot_stats_dummy_sameScale

liamgd commented 1 year ago

Estimate mixup fixed in 009a1ad.

As for the p-values, there are a few places at which, assuming the regression in R is accurate, the regression_single.py p-values could deviate. The t statistics seem to match, so it is a problem with the p-value calculation. The p-value is calculated using torch.distributions.studentT.StudentT(df).log_prob. The degrees of freedom seem to match; with dummy data of the same size, I got 96 degrees of freedom according to:

[INFO] Using CPU with 1 threads
[INFO] Running full regression in expression only mode with "all" region filtration.
[INFOTIMER] Converted C to tensor in 0.0005 seconds
[INFOTIMER] Created root oneX tensor in 0.0002 seconds
[INFOTIMER] Set up output dataframe
[INFO] Running with 96 degrees of freedom
[INFO] Calculating chunks...

The log_prob function runs on the t_stats in the following on line 217 of regression_single.py:

p_value = torch.exp(log_prob(t_stats))

Assuming the base of log_prob is e (which is confirmed by the source code), I do not see where deviation could be aside from cumulative inaccuracy in either R or torch.exp(log_prob(t_stats)).

Is the tecpg p-value consistently and substantially higher than the R p-value for other gene expression and methylation loci pairs? Or, is it sometimes greater than and sometimes less than the R p-value?

rituroy commented 1 year ago

Is it a one-sided test?

liamgd commented 1 year ago

I could not find that information in the torch documentation but it appears to be two-sided.

Plugging the T value from https://github.com/kordk/torch-ecpg/issues/26#issue-1503895974, 1.8765793, into this calculator with 96 degrees of freedom yields 0.031812 for a one sided test and 0.063624, which is very close to the p-value of both analyses.

kordk commented 1 year ago

It appears the p-value is for a one-sided test (see 0.5 in code below for the StudentT function). Can we convert the one-sided value to two-sided?

 p_two_sided = -2 * (p_one_sided - 1)

From https://pytorch.org/docs/stable/_modules/torch/distributions/studentT.html#StudentT

def log_prob(self, value):
        if self._validate_args:
            self._validate_sample(value)
        y = (value - self.loc) / self.scale
        Z = (self.scale.log() +
             0.5 * self.df.log() +
             0.5 * math.log(math.pi) +
             torch.lgamma(0.5 * self.df) -
             torch.lgamma(0.5 * (self.df + 1.)))
        return -0.5 * (self.df + 1.) * torch.log1p(y**2. / self.df) - Z
liamgd commented 1 year ago

The current p-values between tecpg and R are quite close. From what I understand, applying this transformation

In https://github.com/kordk/torch-ecpg/issues/26#issuecomment-1363357195:

 p_two_sided = -2 * (p_one_sided - 1)

to the allegedly one-sided p-value from the first comment computed by tecpg, 0.06933532, yields 1.872752. This does not seem like a valid two-sided p-value.

Are you sure tecpg is outputting a one-sided value? The p-values from tecpg are close to those outputted by R. With the provided formula, these p-values would be scaled beyond even the valid p-value range, (0, 1).

According to the values in https://github.com/kordk/torch-ecpg/issues/26#issue-1503895974, the percent error between the tecpg t value and R t value is 0.022413425%. The percent error between the p-values is less than 10%, which is significant but too small to be a mix up with the "sidedness" of the test.

kordk commented 1 year ago

I'm not confident the conversion formula I provided was correct. I do consider the R results as truth, so we will have to make adjustments to tecpg until we are comfortable that the p-values are reasonable. Adam or Ritu will need to provide us guidance.

kordk commented 1 year ago

We are comfortable with small difference in the test statistic for the R versus tecpg (StudentT).

However, the p-values are not concordant with Ritu's implementation in R to a degree that is acceptable. Given what we've looked at so far, the simplest explaination for the discrepency is a two-sided versus one-sided (left or right) test. The test statistic would remain the same, and only the determination of p will change.

We need to settle on what p-value is being supplied by the StudentT so we can decide if we need to change how it is calculated or convert it ourselves.

It's up to Ritu and Adam to make this determination.

For example, here is a nice description of converting between and right or left one tailed test and a two-tailed test (https://stats.stackexchange.com/questions/267192/doubling-or-halving-p-values-for-one-vs-two-tailed-tests)

image
rituroy commented 1 year ago

These are the p-values in R: For the two-sided hypotheses: (Intercept) methyl age sex 0.02975 0.06361 0.20240 0.35050 For H1: beta < 0 (Intercept) methyl age sex 0.9851 0.9682 0.8988 0.8247 For H1: beta > 0 (Intercept) methyl age sex 0.01488 0.03181 0.10120 0.17530

kordk commented 1 year ago

From Ritu:

log_prob() returns probability density function. We need the cumulative density function (cdf) to calculate the p-value https://pytorch.org/docs/stable/distributions.html

p-value for 2-sided test would be 2*cdf of the value

From Kord:

Based on RItu's comments and the reference site above and (https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/StudentT), the pseudo code may be this:

p_value = torch.exp(2 * cdf(t_stats))
liamgd commented 1 year ago

Ok. I am glad you are confident about the issue. Can you clarify what the cdf function is in p_value = torch.exp(2 * cdf(t_stats))? I am assuming this is the StudentT CDF function, which torch does not have. Tensorflow does have this function, and it is defined in https://github.com/tensorflow/probability/blob/0759c57eb0306a5bb0fcc3d105bbcab9943092a5/tensorflow_probability/python/distributions/student_t.py#L236 as follows:

def _stdtr_computation(df, t): Compute cumulative distribution function of Student T distribution. ```python """Compute cumulative distribution function of Student T distribution.""" dtype = dtype_util.common_dtype([df, t], tf.float32) numpy_dtype = dtype_util.as_numpy_dtype(dtype) half = numpy_dtype(0.5) one = numpy_dtype(1.) df, t = [tf.convert_to_tensor(param, dtype=dtype) for param in (df, t)] broadcast_shape = functools.reduce( ps.broadcast_shape, [ps.shape(df), ps.shape(t)]) df, t = [tf.broadcast_to(param, broadcast_shape) for param in (df, t)] # For moderate df and relatively small t**2, or in case of large df, we use # asymptotic expansion [1, 2] to compute stdtr(df, t). The condition to use # it was specified by experimentation for np.float32 and was taken from [2, # page 249] for np.float64. if numpy_dtype == np.float32: use_asymptotic_expansion = ( (df >= 10.) & (tf.math.square(t) < (16. * df - 5.))) | (df > 30.) else: use_asymptotic_expansion = ( (df >= 100.) & (tf.math.square(t) < (0.1 * df - 5.))) | (df > 1000.) result = _stdtr_asymptotic_expansion(df, t, numpy_dtype) # Otherwise, we evaluate stdtr(df, t) using the regularized incomplete beta # function [3, page 323, equation 6.14.10]: # stdtr(df, t) = # 0.5 * betainc(0.5 * df, 0.5, df / (df + t**2)), when t < 0 # 1. - 0.5 * betainc(0.5 * df, 0.5, df / (df + t**2)), when t >= 0 # To avoid rounding error when t**2 is small compared to df, we compute the # ratio df / (df + t**2) in the logarithmic space. If ratio > 0.99, we then # use the following symmetry relation: # betainc(a, b, x) = 1 - betainc(b, a, 1 - x) . raw_ratio = t * tf.math.rsqrt(df) ratio = tf.math.exp(-log1psquare(raw_ratio)) one_minus_ratio = tf.math.exp(-log1psquare(tf.math.reciprocal(raw_ratio))) # The maximum value for the ratio was set by experimentation. use_symmetry_relation = (ratio > 0.99) half_df = half * df a = tf.where(use_symmetry_relation, half, half_df) b = tf.where(use_symmetry_relation, half_df, half) x = tf.where(use_symmetry_relation, one_minus_ratio, ratio) y = special.betainc(a, b, x) result_betainc = half * tf.where(use_symmetry_relation, one - y, y) # Handle the case (t >= 0). result_betainc = tf.where(t >= 0., one - result_betainc, result_betainc) result = tf.where(use_asymptotic_expansion, result, result_betainc) # Determine if df is out of range (should return NaN output). result = tf.where(df <= 0., numpy_dtype(np.nan), result) return result ```

Is this the right CDF function? If so, I will implement it in torch within p_value = torch.exp(2 * cdf(t_stats)). If you know of a different definition of the StudentT CDF, it would likely be easier to implement than the tensorflow variation.

kordk commented 1 year ago

Given we have still not found a torch library to provide us a CDF for the StudentT distribution, Ritu and I discussed potentially looking at other distributions to potentially estimate one (e.g., Chi2).

We found a discussion about adding distributions to Torch here: https://github.com/pytorch/pytorch/issues/52973

With one developer releasing extensions for univariate distributions: https://github.com/shchur/survival_distributions

Here is a description of other distributions currently available that we may (hopefully) find useful: https://deepmind.github.io/torch-distributions/#toc_32

Adam is away on holiday and we will discuss more with him next week.

liamgd commented 1 year ago

Ok. Let me know what you final decision is.

Also note that in the end, the probability function will likely be a custom implementation in tecpg rather than a torch implementation. This is because optimizations can be made to the function by calculating certain constants before the main calculation (constant folding). This is what was done in #24 with massive performance improvements of around 120,739% faster total execution time. Therefore, a torch implementation of the function is not necessary. An implementation in tensorflow or even just the matrix algebra will be enough to form a tecpg implementation.

kordk commented 1 year ago

That is good to know. Can the cdf() function from scipy work?

On a related note, the cdf() function is available for a distribution in torch: https://pytorch.org/docs/stable/distributions.html

Given we are using the StudentT distribution in torch, will this cdf() work? https://pytorch.org/docs/stable/distributions.html#studentt

liamgd commented 1 year ago

The scipy CDF is function works with the form $2\times (1-\text{cdf}(t, \text{df}))$. With the t value and expected p value from R for methylation in the first comment, the above calculation results in a percent error of less than 0.1%. However, the source of the scipy function is not accessible from what I understand, so I do not know how the function is computed. It is probably done at a low level in C.

Many of the distributions available in torch's distributions module have CDFs. Unfortunately, the StudentT distribution does not yet have this function. There was discussion about this in 2019: https://github.com/pytorch/pytorch/issues/26334.

The Wikipedia page on the StudentT distribution (https://en.wikipedia.org/wiki/Student%27s_t-distribution) defines the CDF as a function of $x$: $\frac{1}{2} + x \Gamma \left( \frac{\nu+1}{2} \right)\times\frac{_2F_1 \left ( \frac{1}{2},\frac{\nu+1}{2};\frac{3}{2};-\frac{x^2}{\nu} \right)}{\sqrt{\pi\nu}\,\Gamma \left(\frac{\nu}{2}\right)}$

This would be easily implementable in torch aside from one issue: $_2F_1()$ is the "hypergeometric function" which is not yet a torch function.

Scipy has the hypergeometric function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.hyp2f1.html#scipy.special.hyp2f1. Similarly to the scipy Student's T CDF, the implementation in C is hard to access. However, the documentation helpfully includes the definition of the hypergeometric function: $\mathrm{hyp2f1}(a, b, c, z) = \sum_{n=0}^\infty \frac{(a)_n (b)_n}{(c)_n}\frac{z^n}{n!}$

I do not know if this definition would yield a reasonable speed when implemented in torch. One alternative would be to convert the T tensor to a numpy array and use scipy to compute the p values. This would require moving from CUDA memory to CPU memory, which is probably exceedingly slow.

liamgd commented 1 year ago

After further research:

https://github.com/pytorch/pytorch/issues/50345 tracks the addition of various special functions including the hypergeometric function and the incomplete beta integral which would also aid in writing a torch Student's T CDF with the relation $1 - \text{stdtr}(k,t) = 0.5 * \text{incbet}( k/2, 1/2, z )$ (https://github.com/scipy/scipy/blob/main/scipy/special/cephes/stdtr.c).

Scipy, among other packages, use cephes, a mathematical C library, for many functions including the Student's T CDF, the hypergeometric function, and the incomplete beta integral (links go to cephes functions inside scipy.special: https://github.com/scipy/scipy/tree/main/scipy/special/cephes).

Developers are working on a CDF for Student's T distribution (https://github.com/pytorch/pytorch/issues/26334) and the incomplete beta function (https://github.com/pytorch/pytorch/pull/58700) in torch. Neither of these are complete, from my understanding. This comment, https://github.com/pytorch/pytorch/pull/58700#issuecomment-1363018222, even mentions how the incomplete beta function would be useful in calculating the CDF.

Until either the Student's T CDF, hypergeometric function, or the incomplete beta function is implemented in torch and usable with CUDA, I do not see how we can compute the CDF. Let me know if you have any ideas.

kordk commented 1 year ago

I was playing around with trying to use TensorFlow Probability (following this guide: https://towardsdatascience.com/introduction-to-tensorflow-probability-6d5871586c0e) and was able to generate a CDF from a T distribution using only tensor library function (see below). Do I misunderstand the cdf() available function here versus it not being available (per Liams comments above)?

That said, I have no idea how this will work with tensors (i.e., not individial scores) and why the p-value calculation from the cdf value does not appear to be valid (i.e., >1.0).

>>> import tensorflow_probability as tfp
>>> tfd = tfp.distributions
>>> n=300
>>> t_dist = tfd.StudentT(df=n-2, loc=0, scale=1)
>>> cdf = t_dist.cdf(.1)
>>> print('CDF:', cdf.numpy())
CDF: 0.53982925
>>> print('2-sided p-value:', 2*cdf.numpy())
2-sided p-value: 1.0796585083007812
rituroy commented 1 year ago

Given the large sample size the t-test and the z-test should be the same, so we should be able to use a Gaussian distribution to estimate the p-value.

liamgd commented 1 year ago

https://github.com/kordk/torch-ecpg/issues/26#issuecomment-1377825020: In tensorflow, the Student's T distribution has a CDF function. Overall, tensorflow has slightly more functions than pytorch, including the hypergeometric function among others. Torch does not yet have such functions. Converting a CUDA torch tensor to tensorflow is a very slow operation. Alternatively, tecpg could move to using tensorflow completely, but this migration could bring up unexpected issues due to the minor differences between these packages.

https://github.com/kordk/torch-ecpg/issues/26#issuecomment-1378216318: I was wondering about this as well, as the Student's T distribution approximates the normal distribution for many degrees of freedom.

The GTP data, for reference, has 340 samples, and thus, 336 degrees of freedom. Using R's t statistic for methylation in the first comment, 1.877, and 336 degrees of freedom, we obtain the following values (calculations done in scipy):

normal_p = 0.9697409530685338
student_p = 0.9693077874772256
percent_error = 0.04468813692662252%

For 50 degrees of freedom (54 samples) and the same t statistic, we have:

normal_p = 0.9697409530685338
student_p = 0.9668205118453522
percent_error = 0.3020665353497092%

Are these errors reasonable? The fewer the degrees of freedom, the higher the error this approximation will yield.

Test code ```python import scipy.stats normal_cdf = scipy.stats.distributions.norm.cdf student_cdf = scipy.stats.distributions.t.cdf t_stat = 1.877 df = 336 normal_p = normal_cdf(t_stat) student_p = student_cdf(t_stat, df) percent_error = abs((normal_p - student_p) / student_p * 100) print(f'{normal_p = }') print(f'{student_p = }') print(f'{percent_error = }%') ```
rituroy commented 1 year ago

p-value = 2 × cdf( - abs(x) )

kordk commented 1 year ago

Nice work!

Let's move forward with using the Gaussian distribution with Ritu's p-value calculation and continue to use torch. We can revisit migrating to tensor in a future major release.

We will make a note of the difference in the p-value estimate in the documentation and include Liam's differences for n=50 and n=300 so the user can make their own decisions.

liamgd commented 1 year ago

I will work on implementing the p-value from the normal/gaussian distribution CDF. Hopefully, torch will implement the Student's T CDF or one of its constituent functions before such a future major release to avoid the migration.

Regarding https://github.com/kordk/torch-ecpg/issues/26#issuecomment-1379289472, p-value = 2 × ( - abs(cdf(x) ), wouldn't this guarantee that the p-value is negative, and thus, out of the [0, 1] range?

liamgd commented 1 year ago

I will start working once the above comment is addressed. Is the two-sided p-value $2\times (-\text{abs}(\text{cdf}(x)))$ (https://github.com/kordk/torch-ecpg/issues/26#issuecomment-1379289472), $e^{(2\times \text{cdf}(\text{x}))}$ (https://github.com/kordk/torch-ecpg/issues/26#issuecomment-1370423022), or $-2\times (x-1)$ (https://github.com/kordk/torch-ecpg/issues/26#issuecomment-1363358931)?

rituroy commented 1 year ago

Sorry for the typo in the previous message, the p-value would be 2 × cdf( - abs(x) )

liamgd commented 1 year ago

Implemented in a413de4. 0.01896% error rate with 296 degrees of freedom tested with https://www.socscistatistics.com/pvalues/tdistribution.aspx.