Raw counts for testing of continuous covariates

Hi,

Thank you developing this great package! I was performing testing for continuous covariates in my single cell data. Please see the formula and function call:

size_factors = np.mean(X_pp.toarray(), axis=1) / np.mean(np.mean(X_pp.toarray(), axis=1))

con_1d = de.test.continuous_1d(data=X_pp.toarray(), formula_loc='~ 1 + pseudotime + repeat + n_genes + total_counts', formula_scale='~ 1', factor_loc_totest='pseudotime', continuous='pseudotime', size_factors=size_factors, df=4, sample_description=adata.obs, gene_names=adata.raw.var_names, as_numeric=['n_genes', 'total_counts'], quick_scale=False, dtype='float64')

X_pp is cell vs gene matrix of raw counts.

I have 2 questions:

For more basic differential expression testing, such as with wald test, I have read that it is best to use raw counts(no normalization, logarithmization, or scaling). As such, I have used the raw counts for testing continuous covariates. However, many genes have counts of 0, likely due to the high dropout rates of single cell. It seems that the fitting of the model is affected by many 0 counts. Please see the two pictures below for the curves of fitted models for 2 genes. Is there a way to account for 0 counts for this type of test?

I have tried to use gene expression after imputation with the magic package. The genes with low p-values are much more biologically relevant than when I use raw counts. However, the curve for the fitted model then below where most of the y-axis/ gene expression values are located. What is the reason for this? Also, the expression matrix is no longer raw counts. Is this still appropriate for the testing?

@davidsebfischer

theislab / diffxpy

Raw counts for testing of continuous covariates #187