satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
208 stars 33 forks source link

Question about a code section in vst(): why are genes with actual_theta/predicted_theta < 0.001 set to theta = Inf #160

Closed ScreachingFire closed 1 year ago

ScreachingFire commented 1 year ago

Hello!

After estimating the NB model parameters within vst(), this section of code appears afterwards:

if (exclude_poisson) {
    genes_amean <- rowMeans(umi)
    genes_var <- row_var(umi)
    genes_amean_step1 <- genes_amean[genes_step1]
    genes_var_step1 <- genes_var[genes_step1]
    predicted_theta <- genes_amean_step1^2/(genes_var_step1 - genes_amean_step1)
    actual_theta <- model_pars[genes_step1, "theta"]
    diff_theta <- predicted_theta/actual_theta
    model_pars <- cbind(model_pars, diff_theta)
    diff_theta_index <- rownames(model_pars[model_pars[genes_step1, "diff_theta"] < 0.001, ])

    if (verbosity > 0) {
      message(paste("Setting estimate of ", length(diff_theta_index), 
        "genes to inf as theta_mm/theta_mle < 1e-3"))
    }

    model_pars[diff_theta_index, 1] <- Inf
    model_pars <- model_pars[, -dim(model_pars)[2]]
  }

I'm just curious about the reasoning behind this section. Prior to learning the model parameters on a subset of genes, there is already a section that filters out genes that do not seem to exhibit overdispersion:

if (do_regularize && exclude_poisson) {
    genes_amean <- rowMeans(umi)
    genes_var <- row_var(umi)
    overdispersion_factor <- genes_var - genes_amean
    overdispersion_factor_step1 <- overdispersion_factor[genes_step1]
    is_overdispersed <- (overdispersion_factor_step1 > 0)

    if (verbosity > 0) {
      message(paste("Total Step 1 genes:", length(genes_step1)))
      message(paste("Total overdispersed genes:", sum(is_overdispersed)))
      message(paste("Excluding", length(genes_step1) - 
        sum(is_overdispersed), "genes from Step 1 because they are not overdispersed."))
    }

    genes_step1 <- genes_step1[is_overdispersed]
    genes_log_gmean_step1 <- genes_log_gmean[genes_step1]
  }

So I'm confused as to why you need to set the theta of some genes to Infinity when the predicted theta is much larger than the model-learned theta. What does this large difference between the thetas imply such that this is needed?

Any insight into this would be helpful! Thank you!

ScreachingFire commented 1 year ago

Correction: The first code section appears in get_model_pars() which is called within vst()

saketkc commented 1 year ago

theta_mm/theta_mle<0.001, gets around the problem of defining an explicit cutoff for "theta_mle" to call something poisson. For some poisson like genes (often with mean < 0.01), it is possible to get an overdispersion factor (MLE-based) like 100 due to convergence issues, but they are effectively poisson. This is what the check ensures.