theislab / diffxpy

Differential expression analysis for single-cell RNA-seq data.
https://diffxpy.rtfd.io
BSD 3-Clause "New" or "Revised" License
179 stars 23 forks source link

Error message ValueError: constrained design matrix is not full rank: 7 8 #180

Open faniafeby opened 3 years ago

faniafeby commented 3 years ago

Hello,

currently, I am using diffxpy for my differential analysis and tried using two factors for my formula_loc = "time_point" and "sample". My data consist of 2 time points (juvenile & adult) and 7 samples for those two time points. But when I run the code, I got the error code as following:

test = de.test.wald(
data=adata_lcpm_1,
formula_loc="~ 1 + time_point + sample",
 coef_to_test="time_point",
factor_loc_totest=["time_point", "sample"]
)
---------------------------------------------------------------------------
 ValueError Traceback (most recent call last)
<ipython-input-52-76f667c65eff> in <module>
----> 1 test = de.test.wald(
2 data=adata_lcpm_1,
3 formula_loc="~ 1 + time_point + sample",
4 coef_to_test="time_point",
5 factor_loc_totest=["time_point", "sample"]

~/anaconda3/envs/diffxpy/bin/diffxpy/diffxpy/testing/tests.py in wald(data, factor_loc_totest, coef_to_test, formula_loc, formula_scale, as_numeric, init_a, init_b, gene_names, sample_description, dmat_loc, dmat_scale, constraints_loc, constraints_scale, noise_model, size_factors, batch_size, backend, train_args, training_strategy, quick_scale, dtype, **kwargs)
645
646 # Build design matrices and constraints.
--> 647 design_loc, design_loc_names, constraints_loc, term_names_loc = constraint_system_from_star(
648 dmat=dmat_loc,
649 sample_description=sample_description,

~/anaconda3/envs/diffxpy/bin/diffxpy/diffxpy/testing/utils.py in constraint_system_from_star(dmat, sample_description, formula, as_numeric, constraints, return_type)
264 as_categorical = True
265
--> 266 return glm.data.constraint_system_from_star(
267 dmat=dmat,
268 sample_description=sample_description,

~/anaconda3/envs/diffxpy/bin/batchglm/batchglm/data.py in constraint_system_from_star(dmat, sample_description, formula, as_categorical, constraints, return_type)
248 if cmat is None:
249 if np.linalg.matrix_rank(dmat) != dmat.shape[1]:
--> 250 raise ValueError(
251 "constrained design matrix is not full rank: %i %i" %
252 (np.linalg.matrix_rank(dmat), dmat.shape[1])

ValueError: constrained design matrix is not full rank: 7 8

I have found a similar issue here, but it was resolved by using as_numeric parameter. Meanwhile, the 'sample' factor is categorical and thus can't be resolved by that method. Could you help me to resolve this problem? Thank you!


I posted this beforehand in the tutorial github, while it should be here.

davidsebfischer commented 3 years ago

Hi @faniafeby, could you post the unique rows of your sample description? ie adata_lcpm_1.obs[["time_point", "sample"]].drop_dupicates()? Likely there is confounding between time and sample.

faniafeby commented 3 years ago

Hi @davidsebfischer , below is the unique rows of my dataset: image After the removal of the duplicates, this table only shows 4 rows out of my n_obs × n_vars = 19330 × 16709. I do agree that there may be confounding between time and sample. So does it mean that I can't use both of the factors together in one run? Thanks!

davidsebfischer commented 3 years ago

Yes, you have to think about what you want to model - the time effect or the time effect while reducing the between sample variance. if you want to do the latter, a trick to run GLMs is to change your setup to

time point, sample, rep
p16, S1, R1
p16, S2, R2
p16, S3, R3
adult, S4, R1

and fit ~1+time+rep+rep:time, which regresses out the variation between R1, R2, R3

faniafeby commented 3 years ago

Because my purpose is the latter, so I should make a new obs to represent the sample and time point combination, and then run the diffxpy as mentioned?

Thank you for the help!

davidsebfischer commented 3 years ago

you can just add the rep col into the .obs, you dont have to recreate it!

shappiron commented 1 year ago

Same issue. Is there a way to generalize this trick if I have 8 samples for young and 8 samples for old groups (16 unique groups in total)? I think it more resembles the case with embedded effects.