shenorrLabTRDF / bseqsc

Bulk-Sequence Single-Cell Gene Expression Deconvolution Pipeline
42 stars 15 forks source link

the number of coefficients (proportions of cell types) in extended model #3

Open gurkanbal opened 7 years ago

gurkanbal commented 7 years ago

Hi,

Bseq_Sc is a great tool. I like it!

However, there is a problem with fitEdgeR, if I want to perform analysis using five or more cell types, fitEdgeR return following error;

"Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05, : Design matrix not of full rank. The following coefficients not estimable: Microglia # (the last of coefficients )"

Nevertheless, its work pretty good with four or less cell types. Is there any bug? Do you have any advice for the analysis with five or more cell types?

Best Gürkan

renozao commented 7 years ago

The limit in the number of coefficients you can estimate is driven by your sample size. You typically need at least n+2 samples to estimate n coefficients (counting all coefficients in the model: intercept, covariates, cell types, group of interest).

How many samples do you have (number of columns in eset) and how many coefficients in the model?

gurkanbal commented 7 years ago

eset contains 156 samples (number of columns in eset), and model contains 2 covariates, 5 cell types and group of interest.

its looks like as follow ;

"fit_edger_ext <- fitEdgeR(eset, ~ Gender + ApoE + oligodendrocytes + astrocytes + microglia + neurons + endothelial + diagnosis_class, coef = 'diagnosis_classDisease_AD')"

this command returned following Error message,

Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05, : Design matrix not of full rank. The following coefficients not estimable: endothelial

But if any one of the cell types was removed, and the model run using 4 cell types, in this case it works.

renozao commented 7 years ago

The cell type proportions probably sum up to one within each sample, which makes them together collinear with the intercept in the model. This is why removing any one of them makes the model completely estimable.

I would try only correcting for cell types that either show significant differences between the diagnosis_class groups or are very dominant relatively to the other cell types.

charlesgwellem commented 5 years ago

If you proceeded with this tutorial till the end, can you please share with me how you prepared the expression set?