theislab / diffxpy

Differential expression analysis for single-cell RNA-seq data.
https://diffxpy.rtfd.io
BSD 3-Clause "New" or "Revised" License
179 stars 23 forks source link

Raw counts for testing of continuous covariates #187

Open dsg03 opened 3 years ago

dsg03 commented 3 years ago

Hi,

Thank you developing this great package! I was performing testing for continuous covariates in my single cell data. Please see the formula and function call:

size_factors = np.mean(X_pp.toarray(), axis=1) / np.mean(np.mean(X_pp.toarray(), axis=1))

con_1d = de.test.continuous_1d(data=X_pp.toarray(), formula_loc='~ 1 + pseudotime + repeat + n_genes + total_counts', formula_scale='~ 1', factor_loc_totest='pseudotime', continuous='pseudotime', size_factors=size_factors, df=4, sample_description=adata.obs, gene_names=adata.raw.var_names, as_numeric=['n_genes', 'total_counts'], quick_scale=False, dtype='float64')

X_pp is cell vs gene matrix of raw counts.

I have 2 questions:

  1. For more basic differential expression testing, such as with wald test, I have read that it is best to use raw counts(no normalization, logarithmization, or scaling). As such, I have used the raw counts for testing continuous covariates. However, many genes have counts of 0, likely due to the high dropout rates of single cell. It seems that the fitting of the model is affected by many 0 counts. Please see the two pictures below for the curves of fitted models for 2 genes. Is there a way to account for 0 counts for this type of test?
  1. I have tried to use gene expression after imputation with the magic package. The genes with low p-values are much more biologically relevant than when I use raw counts. However, the curve for the fitted model then below where most of the y-axis/ gene expression values are located. What is the reason for this? Also, the expression matrix is no longer raw counts. Is this still appropriate for the testing?

@davidsebfischer