scverse / pertpy

Perturbation Analysis in the scverse ecosystem.
https://pertpy.readthedocs.io/en/latest/
MIT License
92 stars 19 forks source link

Weird behavior of Sccoda #589

Closed ngvananh2508 closed 1 month ago

ngvananh2508 commented 1 month ago

Report

When I ran this snippet code: image It gave me the unexpected MuData: obs of coda was 'gender', not 'condition'. condition has 2 categorical variables: 'T2D' and 'non-diabetic. This dataset has four batches: 'batch 1', 'batch 2', 'batch 3', 'batch 4'.

Version information

pertpy 0.8.0

Zethson commented 1 month ago

It gave the unexpected MuData: obs of coda was 'gender', not 'condition'. condition has 2 categorical variables: 'T2D' and 'non-diabetic. This dataset has four batches: 'batch 1', 'batch 2', 'batch 3', 'batch 4'.

Sorry, I don't quite understand the problem. Did you get a traceback or what is the issue?

Thanks!

ngvananh2508 commented 1 month ago

image In your tutorial, when you assign ['condition'] to covariate_obs, coda object will have 'condition' obs. But when I followed the tutorial, I received a coda object with 'gender' obs instead of 'condition' obs. This is my adata.obs: image

Zethson commented 1 month ago

Okay, thanks. I'll look into it as soon as possible

ngvananh2508 commented 1 month ago

My Anndata object only has one value for gender obs: 'male'. I want to compare the cell composition between two groups: T2D and non-diebetic of 'condition' obs.

ngvananh2508 commented 1 month ago

Okay, thanks. I'll look into it as soon as possible

Thank you very much.

Zethson commented 1 month ago

Would it be possible for me to have your dataset or would you be able to create a minimal working example to reproduce the bug, please?

ngvananh2508 commented 1 month ago

This is the dataset. I ran these code lines: sccoda_model_draft = pt.tl.Sccoda() sccoda_data_draft = sccoda_model_draft.load( adata_1, type="cell_level", generate_sample_level=True, cell_type_identifier="celltype", sample_identifier="batch", covariate_obs=["condition"], ) sccoda_data_draft

MuData object with n_obs × n_vars = 3803 × 4013 2 modalities rna: 3799 x 4000 obs: 'gender', 'age', 'condition', 'batch', 'celltype' layers: 'counts' coda: 4 x 13 obs: 'gender' var: 'n_cells'

ngvananh2508 commented 1 month ago

I can not upload the data on here. Can I send the data to you via your email?

Zethson commented 1 month ago

Yes you can! You can find my email on my Github profile. You can also send me a link to a file hoster there.

ngvananh2508 commented 1 month ago

And by the way, this dataset composes of only four batches. Do you know the way to increase the statistical significance of each cell type in the dataset. Now I have to set the fdr equaling to 0.5, If I set it to a lower value, I will not get any information about the changes of the cell types. I mean that when I ran: sccoda_model_male.credible_effects(sccoda_data_male_final, modality_key="coda") It return all False values.

Zethson commented 1 month ago

I'll check the issue out tomorrow. On my TODO list.

And by the way, this dataset composes of only four batches. Do you know the way to increase the statistical significance of each cell type in the dataset. Now I have to set the fdr equaling to 0.5, If I set it to a lower value, I will not get any information about the changes of the cell types. I mean that when I ran: sccoda_model_male.credible_effects(sccoda_data_male_final, modality_key="coda") It return all False values.

You can consider testing healthy versus one condition separately instead of testing vs all non healthy conditions jointly. If this is not an option for your dataset, I am afraid that you may not have any significant shifts in composition. But maybe @johannesostner wants to chip in?

ngvananh2508 commented 1 month ago

No, the condition variable of the dataset only comprises two values: non-diabetic and T2D. I know that if the model uses the Gaussian distribution as null model, the dataset must contain many batches to use this assumption according to the central limit theorem.

johannesostner commented 1 month ago

Regarding significance:

Even though scCODA does not use the Gaussian as a null distribution (it's a Bayesian model, so there is no "Null model", but rather the posterior gets informed by the prior and the observations), the concept of more samples -> higher power still holds. Therefore, there is no way to increase statistical significance without violating some foundational statistical principles.

ngvananh2508 commented 1 month ago

Regarding significance:

Even though scCODA does not use the Gaussian as a null distribution (it's a Bayesian model, so there is no "Null model", but rather the posterior gets informed by the prior and the observations), the concept of more samples -> higher power still holds. Therefore, there is no way to increase statistical significance without violating some foundational statistical principles.

So scCODA can only be applied for the data which composes of many batches or can we create batches by ourselves (just randomly creating samples for the statistical test)?

johannesostner commented 1 month ago

This depends on the effect strength. For very strong effects, small sample sizes are sufficient to detect them also at lower FDR levels. For weaker effects, you will need more samples to find them with high credibility.

Randomly creating samples (e.g. through bootstrapping) is a known practice, but must be done correctly to not violate statistical principles. Nevertheless, this will not improve your statistical power if done correctly

ngvananh2508 commented 1 month ago

If I randomly assign each cell to one of a list from 'batch 1' to 'batch 10', will this technique enhance the statistical power?

Zethson commented 1 month ago

I identified the issue and pushed a fix to main. However,

adata.obs.groupby("batch").nunique() shows that your batches do not only have single conditions but several.

The expectation is:

image

But your dataset results in:

image

You identify that by a warning.

Could you please test with main?

ngvananh2508 commented 1 month ago

What do you mean "test with main"?

Zethson commented 1 month ago

@ngvananh2508 sorry for my brievity

pip install git+https://github.com/theislab/pertpy.git@main

ngvananh2508 commented 1 month ago

@ngvananh2508 sorry for my brievity

pip install git+https://github.com/theislab/pertpy.git@main

Thank you. I will test it and let you know the result ASAP.

ngvananh2508 commented 1 month ago

@ngvananh2508 sorry for my brievity

pip install git+https://github.com/theislab/pertpy.git@main

When I install this pertpy on a new env and import it, it raise this error: ModuleNotFoundError: No module named 'arviz' ModuleNotFoundError: No module named 'pydeseq2' ModuleNotFoundError: No module named 'formulaic' ModuleNotFoundError: No module named 'toytree' I can get around these issues by installing them, but I still report in case you want to modify your library

ngvananh2508 commented 1 month ago

I identified the issue and pushed a fix to main. However,

adata.obs.groupby("batch").nunique() shows that your batches do not only have single conditions but several.

The expectation is:

image

But your dataset results in:

image

You identify that by a warning.

Could you please test with main?

So you mean that each batch only consists of one condition? And is this the reason that sccoda appears weird behavior when running with my data?