t-neumann / slamdunk

Streamlining SLAM-seq analysis with ultra-high sensitivity
GNU Affero General Public License v3.0
37 stars 22 forks source link

mRNA half life calculation #138

Closed ZoeyYang912 closed 8 months ago

ZoeyYang912 commented 10 months ago

Hello,

I have two cell types (nodox, and addox) that I did pulse/chase SLAM-seq. I pulsed for 24 hours, and chased at 0h, 1h, 5h and 12hrs. I run slamdunk all via docker and obtained tsv files, and I am using the TcReadCount and Readcount columns to do the analysis. I used Deseq2 to normalize TcReadCounts using ReadCount for global normalization as suggested in issue #70
ddsRC<-DESeqDataSetFromMatrix(countData = ReadCount, colData = Pheno, design = ~1) ddsRC <- estimateSizeFactors(ddsRC) ddsTcRC<-DESeqDataSetFromMatrix(countData = TcReadCount, colData = Pheno, design = ~1) sizeFactors(ddsTcRC) <- sizeFactors(ddsRC) TcReadCounts_normalized <- counts(ddsTcRC, normalized=TRUE)

Once I obtained normalized TcReadCount, I subtracted no s4U TcReadCount from the chased time point as to remove "background", and normalized 1h, 5h and 12h to 0h. So 0h will be 1, and the 1h, 5h, 12h will be the fold change to 0h. And my data looked like this

image

I run the simple exponential decay model using the Levenberg-Marquardt algorithm, however, it seemed like the half-lives for every gene was either 0 or 1.386294, and 1.386294 is coming from k=0.5

I wonder if there is anything wrong with my method and my code, and how I should calculate the mRNA half-life for each transcript. I am attaching my code and output in the end, as I am new to the analysis, any suggestions and comments will be greatly appreciated!

time_points <- as.numeric(colnames(nodox)) analyze_genes <- function(nodox) { # Define the exponential decay model with a plateau term exp_decay_plat <- function(t, y0, Plat, k) { Plat + (y0 - Plat) * exp(-k * t) }

# Initialize a vector for half-lives half_lives <- numeric(nrow(nodox))

# Convert from wide to long format time_points <- as.numeric(colnames(nodox)) gene_names <- rownames(nodox)

# Loop over each gene for (i in 1:nrow(nodox)) { gene_data <- data.frame(time = time_points, y = as.numeric(nodox[i, ]))

# Fit the model tryCatch({ model <- nlsLM(y ~ exp_decay_plat(time, y0, Plat, k), data = gene_data, start = list(y0 = 1, Plat = 0, k = 0.5), upper = c(1, 0, Inf), lower = c(1, 0, 0), control = nls.lm.control(maxiter = 1000), na.action = na.omit)

# Calculate half-life parameters <- coef(model) half_lives[i] <- log(2) / parameters["k"] }, error = function(e) { cat("Error in gene:", gene_names[i], "\n") half_lives[i] <- NA }) }

# Combine gene names with their half-lives half_life_df <- data.frame(gene = rownames(nodox), half_life = half_lives) }

output:

image
t-neumann commented 10 months ago

Hi - what I would try to do is to simply use the conversionRates or fractions (tcreads/totalreads) as input to estimate the half-lifes, instead of the elaborate DESeq2 based normalization and error correction step.

isaacvock commented 10 months ago

While I agree with Tobias that your normalization strategy is a bit overengineered, just plotting the data for one gene (e.g., the one at the top of the table) reveals a nice exponential-ish decay curve that should be easily fit, suggesting a larger problem with the fitting strategy.

I think the solution is simply to adjust your initial guess. Your initial guess of a rate constant of 0.5 min-1 is pretty high. RNA half-lives in mice are usually on the order of several hours, with the average being around 4 hours in mESC. 4 hour half-life would correspond to a degradation rate constant of 0.002 min-1, many orders of magnitude lower than your current initial guess. This could be causing some overflow problems when calculating the gradient, leading to the weird fitting behavior. Below is a reproducible example so that you can see what I am talking about:

# Load dependencies ------------------------------------------------------------

library(dplyr)
library(ggplot2)
library(tibble)
library(minpack.lm)

# Issue #138 (mRNA half-life calculation) --------------------------------------

# function to define exponential decay
exp_decay_plat <- function(t, Plat, y0, k) {
  Plat + (y0 - Plat) * exp(-k * t)
}

# Data for one gene
data <- tibble(time = c(0, 60, 300, 1200),
               y = c(1, 0.7022481, 0.37065, 0.0211))

# Initial k of 0.5
model_original <- nlsLM(y ~ exp_decay_plat(time, Plat, y0, k),
               data = data,
               start = list(y0 = 1, Plat = 0, k = 0.5), 
               upper = c(1, 0, Inf), lower = c(1, 0, 0), 
               control = nls.lm.control(maxiter = 1000), 
               na.action = na.omit)

# Initial k of 0.001
model_better_init <- nlsLM(y ~ exp_decay_plat(time, Plat, y0, k),
                        data = data,
                        start = list(y0 = 1, Plat = 0, k = 0.001), 
                        upper = c(1, 0, Inf), lower = c(1, 0, 0), 
                        control = nls.lm.control(maxiter = 1000), 
                        na.action = na.omit)

# Assess model fits
summary(model_original)
  # k = 0.5

summary(model_better_init)
  # k = 0.00375

### Try out several ks to reveal when things go awry

# Initial k values to test
ks <- seq(from = 0.001, to = 0.5, length.out = 300)

# Vector of k estimates
result <- rep(0, times = length(ks))

# Try each initial value
for(i in seq_along(ks)){

  init <- ks[i]

  model_diff_init <- nlsLM(y ~ exp_decay_plat(time, Plat, y0, k),
                             data = data,
                             start = list(y0 = 1, Plat = 0, k = init), 
                             upper = c(1, 0, Inf), lower = c(1, 0, 0), 
                             control = nls.lm.control(maxiter = 1000), 
                             na.action = na.omit)

  estimate <- coef(model_diff_init)
  result[i] <- estimate['k']

}

# Plot result
plot_df <- tibble(estimate = result,
                  initial_k = ks)

ggplot(plot_df, aes(x = initial_k,
                    y = estimate)) + 
  geom_point() +
  theme_classic() + 
  xlab("Initial k estimate") + 
  ylab("Final k estimate")

Here is the plot made at the end of this code block, showing what the model's estimate is as a function of initial value. You can see the point where the model breaks down and begins spitting out the initial value as its estimate.

Impact_of_initial_value

The 0s in your table of estimates on the other hand seem to be the results of NAs in your data.

Also, is there any reason the time noted in the table for what I assume to be the 12h chase is 1200 rather than 720 (12*60)?

Finally, since the half-lives for individual transcripts can vary by several orders of magnitude, it might be best to determine a good initial guess from the data. For example, you can calculate the expected degradation rate constant estimate for one time point as -log(1 - y)/(time) and use that as your initial guess, or perform this calculation for all of the time points and use the average as your initial guess.

Best, Isaac

ZoeyYang912 commented 10 months ago

Thank you @t-neumann @isaacvock for the input!

I tried using both Deseq2 normalized TCreadcount and Conversionrate, I don't think using different inpu shift tends in half lives too much, but I will keep an eye when looking at specific genes.

@isaacvock Thanks a lot for looking into this issue! You are right, 1200 should be 720, that's the typo on my end. I played around with k values and found that the k of 0.5 does not fit the data well since I was using minutes as the unit for time (0, 60, 300, 720). But if I switch to use hours (0,1,5,12), k of 0.5 can be a starting point for the model, and I generated a half lives table like this. I guess these values now look reasonable

image
isaacvock commented 10 months ago

No problem! Your idea to convert from minutes to hours makes a lot of sense, and I agree that the new output looks reasonable.

ZoeyYang912 commented 9 months ago

An additional question...So if I have triplicates for each time points, when I calculate half life, I should average the conversionRates or fractions (tcreads/totalreads) of triplicates, and then use the averaged numbers to calculate, correct? And I will get one half-life value for each transcript, I guess there is no way to do statistics using one half-life value?

t-neumann commented 9 months ago

Well I guess you could also fit a curve through three points (triplicates) per time point and do some kind of curve-fit Rsquare measure that would tell you how well the curve fits?

isaacvock commented 9 months ago

I suspect the goal is to do some sort of comparative analysis of half-life values +/- dox. In that case, the easiest rigorous thing I could imagine you doing is a linear fit of log(tcreads/totalreads) vs. chase time using all data points for all chase times, and using the slope estimate along with its standard error to generate a test-statistic for a comparative analysis of the estimated slopes between the two conditions. An example of what that could look like is:

# Number of replicates in each dataset
n_nodox <- 9
n_dox <- 9

# Generate simulated data for two datasets

# Simulate data for no dox data
tchase_nodox <- rep(c(1, 5, 12), each = 3)
log_fn_nodox <-  -0.1*tchase_nodox + rnorm(9, mean = 0, sd = 0.05)  # Assuming a kdeg 0.1

# Simulate data for +dox data
tchase_dox <- rnorm(n2, mean = 5, sd = 2)
log_fn_dox <- -0.4*tchase_dox + rnorm(n2, mean = 0, sd = 0.05)  # Assuming a kdeg of 0.4

nodox_data <- data.frame(x = tchase_nodox, y = log_fn_nodox)
dox_data <- data.frame(x = tchase_dox, y = log_fn_dox)

# Fit linear models with intercept fixed at 0
model_nodox <- lm(y ~ x + 0, data = nodox_data)
model_dox <- lm(y ~ x + 0, data = dox_data)

# Extract the slope coefficients
slope_nodox <- coef(model_nodox)["x"]
slope_dox <- coef(model_dox)["x"]

# Calculate the standard errors of the slopes
se_nodox <- summary(model_dox)$coefficients["x", "Std. Error"]
se_dox <- summary(model_nodox)$coefficients["x", "Std. Error"]

# Perform a t-test for comparing the slopes
t_value <- (slope_dox - slope_nodox) / sqrt(se_nodox^2 + se_dox^2)
df <- (se_nodox^2 / n_nodox + se_dox^2 / n_dox)^2 / ((se_nodox^2 / n_nodox)^2 / (n_nodox - 1) + (se_dox^2 / n_dox)^2 / (n_dox - 1))  # degrees of freedom
p_value <- 2 * pt(-abs(t_value), df)

With this strategy, you could get p-values for comparisons of all transcripts for which you have data in both conditions, and then multiple test-adjust those p-values with something like Benjamini-Hochberg (i.e., padj <- p.adjust(p_values, method = "BH")).

To be more explicit, the assumption is that the fraction of RNA that is s4U labeled = exp(-kdeg*tchase), so log(fraction s4U) = -kdeg*tchase. Thus, the regression slope of log(fraction s4U) vs. tchase will give you -1 * (degradation rate constant estimates), and R's lm() function will provide the standard error of the estimated slope. (tcreads/totalreads) makes sense as an estimate for fraction of reads that are from s4U labeled RNA, though you can run into problems of this being 0 and thus the log of this being infinite. To get around this, it makes sense to add some pseudocounts so that your final estimate of fraction s4U is ((tcreads + 1)/(totalreads + 2)). The exact pseudocounts used here can be understood as a relatively uninformative prior on the fraction s4U estimate.

There is a similar strategy that can be employed with the nonlinear fitting strategy you are currently using to get half-life estimates, but I think its a bit more computationally intensive to get the standard error of the fit coefficient estimates in that case, and I don't personally have experience doing that.

There isn't really one "right" answer though, and @t-neumann's idea of separate fits could be another way to go, in which case you could use the estimates from the individual fits as "replicates" of half-life estimates and perform a t-test with those estimates from the +/- dox samples. I think the statistical power of the strategy detailed above will be higher than this strategy, since the individual data points are treated as replicates (as they should be) rather than sets of three replicates being converted into one replicate estimate, and it might not be obvious which chase replicates to group together for the separate fits (unless there are very clear batches). You could also just calculate a kdeg estimate from each data point as -log(fraction s4U)/tchase, and then do t-tests with these replicate estimates, though that tosses out information about the expected relationship between time points.

In the end, if we're being honest, the exact strategy that you go with probably won't radically alter your conclusions, some strategies may just be easier to get by future reviewers than others ;).

ZoeyYang912 commented 9 months ago

Thank you both for such detailed explanation! I will give it a try and let you know. Thanks!

ZoeyYang912 commented 9 months ago

So I tried @isaacvock 's method, did a linear fitting of ln(tcreads/totalreads) versus chase time, and compared the slope and do t-test between +/- dox group.

I sumed up raw TcReadCount and Readcount mapped to different UTR annotations of the same gene according to this paper before I feed the data into the linear model. https://www.science.org/doi/10.1126/science.aao2793 I was not able to identify too many genes that had a significantly different slope (padj<0.05) between +/- dox, only 16 out of 11895 genes. image

These 16 genes also seemed odd as 1) some of them the slope is a positive value, and 2) some genes seemed to be only expressed in either + or - dox samples. I suspect that I should do some sort of filtering on the ReadCount to get rid of genes that are not expressed, or I should not sum up the readcount for different UTR annotations? But I agree that the analysis strategy should not radically alter the conclusion.

isaacvock commented 9 months ago

Yeah, restricting the analysis to genes with at least 50 or so reads in all samples might be a good idea for several reasons. The standard error of the regression may underestimate the uncertainty for very low coverage genes and thus cause them to come up significant in the +/- dox comparison when the truth is that there isn't enough data to accurately pin down the half-life when the expression level is low. This filtering will also have the benefit of increasing your statistical power due to lowering the multiple-testing burden.

Even if you continue to not get a lot of "significant" genes, you could also do some gene-set analyses to see if the largest effect sizes are enriched for particular biochemical pathways, such as:

  1. Gene-set enrichment analysis ordering genes by t_value (examples of this kind of analysis with SLAM-seq and similar datasets can be found in figure 6D and supplemental figure S33 of this paper)
  2. GO analysis with the top x genes (x could be something like 100), where "top" could be defined as:
    • Lowest t_values
    • Highest t_values
    • Highest abs(t_values)
ZoeyYang912 commented 8 months ago

Thank you! @isaacvock That's super helpful!