theislab / diffxpy

Differential expression analysis for single-cell RNA-seq data.
https://diffxpy.rtfd.io
BSD 3-Clause "New" or "Revised" License
192 stars 23 forks source link

Test between two clusters of a Scanpy object #92

Open Gpasquini opened 5 years ago

Gpasquini commented 5 years ago

I am having hard times in performing a simple Wald test on two specified clusters of a data

I annotate the sample description of my Scanpy AnnData. To test one group versus all the rest of the cells, this command works: test = de.test.versus_rest( data=adata.raw, grouping="condition", test="wald", noise_model="nb", sample_description=adata.sample_description,

batch_size=100,

training_strategy="DEFAULT",
dtype="float64"

)

*adata.sample_description is a DataFrame with one column 'condition'

Let's say that in condition I have 5 labels like 'endo', 'neuron', 'fibro', 'beta', 'alpha', 'stem'. How can I test the differential expressed genes between 'neuron' and 'stem'?

This is what I am trying to run: test = de.test.wald( data=adata.raw, formula_loc="~ neuron - stem", factor_loc_totest="condition", grouping="condition", noise_model="nb", sample_description=adata.sample_description, training_strategy="DEFAULT", dtype="float64" )

returning: [...] ~/Library/Python/3.7/lib/python/site-packages/patsy/compat.py in

PatsyError: Error evaluating factor: NameError: name 'neuron' is not defined ~ neuron - stem ^^^^^^

davidsebfischer commented 5 years ago

Hi @Gpasquini! In differential expression analysis on categorical covariates in R and in python we often refer to two differen things: (1) factors/covariates/predictors, which correspond to sets of parameters and (2) levels/groups, which correspond to individual parameters. 'endo', 'neuron', 'fibro', 'beta', 'alpha', 'stem' are all groups, those are unique entries of a list of strings or a pandas categorical series. "condition" is a covariate here. Patsy is complaining because you try to set up a model based on a group.

Differential expression analysis is usually interfaced so that the user requests to test covariates rather than groups, this is done because this gives the framework freedom to assemble parameters as required which can be complex if there is confounding or interaction terms for example. So to run diffxpy on the comparison of two groups, you should first subset the adata object and then test the new covariate "condition" that now only contains two unique groups:

test = de.test.wald( data=adata[[x in ["neuron", "stem"] for x in adata.obs["condition"]], :].raw, formula_loc="~1+condition", factor_loc_totest="condition", noise_model="nb", sample_description=adata.sample_description, training_strategy="DEFAULT", dtype="float64" )

By the way, de.test.versus_rest is handling this in a very similar fashion! Accordingly, the subset of tests that correspond to this comparison should have the same results.

Hope that helped!

ivirshup commented 5 years ago

I have frequently wished it were easier to compute differential expression between two arbitrary groups. This particularly comes up when dealing with multiple clusterings or independent labels (i.e. you've only labeled one cell type). Right now it can be difficult, since the workflow becomes:

adata.obs["comp"] = ""
adata.obs.loc[bool_vec1, "comp"] = "a"
adata.obs.loc[bool_vec2, "comp"] = "b"
view = adata[adata.obs["comp"] != ""]
de.test.wald(
    ...
)

While just passing the bool vectors would be more convenient. I think this is especially useful when you're labelling or deciding on clusters.

Differential expression analysis is usually interfaced so that the user requests to test covariates rather than groups

I think this was the most convenient approach when most of the work was fitting a statistical model to known groups. With single cell stuff I think it's become much more common to use DE as an exploratory tool for the labelling process.

davidsebfischer commented 5 years ago

Hi @ivirshup, I just found this comment, sorry for not reacting earlier. Am I correct in assuming that your remark is resolved with the pairwise and partition API that we have discussed lately (https://github.com/theislab/diffxpy/issues/108)?

ivirshup commented 5 years ago

I think the pairwise API partially fits my use case, especially when it can be done lazily. I'm not actually too sure what the partition function does. Could you point me towards some examples for that?

davidsebfischer commented 5 years ago

Yes, I think pairwise does what you need. Partition runs a test on multiple partitions of a data set, e.g. test effect of condition in each cell type cluster. I ll provide a new example for that this week!

Benfeitas commented 2 years ago

hi @davidsebfischer and @ivirshup , thanks for the comments and discussion.

A few comments above you also mention that OP's question is addressable through versus_rest, but I am finding the same problem as this. So the only way to evaluate it is to compare each cluster vs all others through loops.

I am also interested in comparing condition per cluster (e.g. testing endovs neuro per cluster for all clusters). Did you get to post an example for partition? I couldn't find it, but perhaps I didnt look in the correct places. Again, I can test it in a loop, but I'm wondering if there's a more efficient way to do it.

Finally, I'm wondering if it is possible to get some verbose updates. In versus_rest it is mentioned we can include kwargs for fit, but I didnt succeed in getting any updates on the progress of versus_rest, so I'm unable to understand what it is doing at any point