kordk / torch-ecpg

(GPU accelerated) eCpG mapper
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Linear regression approach #10

Closed kordk closed 1 year ago

kordk commented 1 year ago

We would like to implement a linear regression to test for an association between methylation levels (dependent variable) and gene expression and other co-variates (independent variables).

The goal is to regress log expression signals for one transcript on methylation β-values for a single CpG, while controlling for fixed effects (e.g., age and sex for GTP). Following other examples, this is the model for the eQTM analysis:

   Methylation_Beta_Score ~ GeneExpression_Level + age + sex

Here is the first and only instructions I could find about obtaining model parameters: https://python-bloggers.com/2022/03/multiple-linear-regression-using-tensorflow/

We can us this code to obtain a t-value for each co-variate using tensors (see example below). Given a t-value for a co-variate, we can obtain a p-value from a 2-sided t-test.

Per Ritu: We need the cdf of the t distribution for d degrees of freedom (df) Then for a 2-sided test,

p-value = 2 * cdf(−|tscore|)

For the model

Methylation_Beta_Score ~ GeneExpression_Level + age + sex

df = N-4, where N = sample size, 2 dfs lost from the 2 continuous variables Methylation_Beta_Score & age, 2 dfs lost due to 2 levels of sex (male & female). So, the dfs will depend on the number of co-variates the user provides.

The cdf should be able to be calculated from here (as with the correlation code): https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/StudentT

Here is example code using tensors to calculate the regression:

#========================================================#
# Quantitative ALM, Financial Econometrics & Derivatives 
# ML/DL using R, Python, Tensorflow by Sang-Heon Lee 
#
# Adapted from:
# https://kiandlee.blogspot.com
# https://python-bloggers.com/2022/03/multiple-linear-regression-using-tensorflow/
#========================================================#

#========================================================#
# Linear Regression model using Tensorflow 
#========================================================#

import pandas as pd
import numpy as np
from sklearn import datasets, linear_model

#========================================================#
# Load the diabetes dataset
#
# https://scikit-learn.org/stable/datasets/toy_dataset.html
#
# Number of Instances:
# 506
# Number of Attributes:
# 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
#========================================================#

X, y = datasets.load_diabetes(return_X_y=True)
nrow, ncol = X.shape; print (nrow, ncol) 
nparam = ncol+1 # number of parameters

v_row_name = np.hstack([["const"], ["X"+str(i) for i in range(1,ncol+1)]])

#========================================================#
# Using matrix formula (Tensorflow)
#========================================================#

print("\n========== using Tensorflow ==========")
import tensorflow as tf

# Current CUDNN and CUDA compatability
# https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-850/support-matrix/index.html
# nvcc -v
# Cuda compilation tools, release 11.7, V11.7.99
# /usr/include/x86_64-linux-gnu/cudnn_v*.h
# #define CUDNN_MAJOR 8
# #define CUDNN_MINOR 5
# #define CUDNN_PATCHLEVEL 0

# https://stackoverflow.com/questions/38303974/tensorflow-running-error-with-cublas
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)

# from np.array
y = tf.constant(y, shape=[nrow, 1]) 
X = tf.constant(X, shape=[nrow, ncol])

# need double tensor
one  = tf.cast(tf.ones([nrow, 1]), tf.float64)
oneX = tf.concat([one, X], 1); # 1, X

XtX  = tf.matmul(oneX, oneX, transpose_a=True)
Xty  = tf.matmul(oneX, y,    transpose_a=True)
beta = tf.matmul(tf.linalg.inv(XtX), Xty)
err  = y - tf.matmul(oneX, beta)
s2   = tf.matmul(err, err, transpose_a=True)/(nrow - nparam)
cov_beta = s2*tf.linalg.inv(XtX)
std_err  = tf.sqrt(tf.linalg.diag_part(cov_beta))
beta = tf.reshape(beta,[nparam])

est_out   = tf.stack([beta, std_err, beta/std_err], 1)
df_out_tf = pd.DataFrame(np.asarray(est_out))
df_out_tf.columns = ["estimate", "std.err", "t-stats"]
df_out_tf.index   = v_row_name

print("\n========== Tensorflow Report ==========")
print(df_out_tf)

Here is the output:

# kord@pnldev [10:17:58] ~/proj/gpu-py/mlr $
   python test.tf.py
   442 10

   ========== using Tensorflow ==========
   2022-10-24 10:20:58.783919: I tensorflow/core/platform/[cpu_feature_guard.cc:193](http://cpu_feature_guard.cc:193/)] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
   To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
   2022-10-24 10:20:59.175524: E tensorflow/stream_executor/cuda/[cuda_blas.cc:2981](http://cuda_blas.cc:2981/)] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
   2022-10-24 10:21:03.836747: W tensorflow/stream_executor/platform/default/[dso_loader.cc:64](http://dso_loader.cc:64/)] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64:/usr/local/cuda-11.7/lib64::/lib:/lib64:/usr/lib:/usr/lib64:/usr/local/lib:/lib:/lib64:/usr/lib:/usr/lib64:/usr/local/lib
   2022-10-24 10:21:03.837031: W tensorflow/stream_executor/platform/default/[dso_loader.cc:64](http://dso_loader.cc:64/)] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64:/usr/local/cuda-11.7/lib64::/lib:/lib64:/usr/lib:/usr/lib64:/usr/local/lib:/lib:/lib64:/usr/lib:/usr/lib64:/usr/local/lib
   2022-10-24 10:21:03.837068: W tensorflow/compiler/tf2tensorrt/utils/[py_utils.cc:38](http://py_utils.cc:38/)] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
   2022-10-24 10:21:07.134165: I tensorflow/core/platform/[cpu_feature_guard.cc:193](http://cpu_feature_guard.cc:193/)] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
   To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
   2022-10-24 10:21:07.967043: I tensorflow/core/common_runtime/gpu/[gpu_device.cc:1616](http://gpu_device.cc:1616/)] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 719 MB memory:  -> device: 0, name: NVIDIA A2, pci bus id: 0000:81:00.0, compute capability: 8.6
   2022-10-24 10:21:07.977327: I tensorflow/core/common_runtime/gpu/[gpu_device.cc:1616](http://gpu_device.cc:1616/)] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 719 MB memory:  -> device: 0, name: NVIDIA A2, pci bus id: 0000:81:00.0, compute capability: 8.6
   2022-10-24 10:21:09.220392: I tensorflow/core/util/[cuda_solvers.cc:179](http://cuda_solvers.cc:179/)] Creating GpuSolver handles for stream 0x5649bf3f9890

   ========== Tensorflow Report ==========
            estimate     std.err    t-stats
   const  152.133484    2.575854  59.061366
   X1     -10.009866   59.749247  -0.167531
   X2    -239.815644   61.222344  -3.917126
   X3     519.845920   66.533445   7.813302
   X4     324.384646   65.421992   4.958343
   X5    -792.175639  416.679870  -1.901161
   X6     476.739021  339.030495   1.406183
   X7     101.043268  212.531457   0.475427
   X8     177.063238  161.475795   1.096531
   X9     751.273700  171.899982   4.370412
   X10     67.626692   65.984282   1.024891
liamgd commented 1 year ago

Implemented in torch with minor optimizations. There are many optimizations that could be made in the future.