theislab / scCODA

A Bayesian model for compositional single-cell data analysis
BSD 3-Clause "New" or "Revised" License
141 stars 23 forks source link

interpreting the output with the issue of compositionality. #93

Closed Marwansha closed 1 month ago

Marwansha commented 4 months ago

I am trying to get my head around the compositionality issue and i would like to ask 2 questions,

1)using poisson regression and wilcoxon rank sum test i can see strong effect on my cell_type proportions with decrease of cd8 naive cells with age, and increase of cd8 effector cells
scCODA only show CD8 naive with final parameter more than 0 at 0.05 fdr , at 0.2 fdr the effect of the cd8 effector start to appear significant. how would you recommend approaching this results because i am trying to see if this cd8E subset is compositionality issue due to decrease of naive subset.

2) Does facs data results also have issue of compositionality ? cause i see other paper not addressed the issue, but rather confirm there finding on facs data.

Thanks alot in advance Best Marwan

mbuttner commented 4 months ago

Hi Marwan, thank you for your questions. 1) We expect more "hits" with a Poisson regression and Wilcoxon rank sum test, because both models ignore the nature of compositional data. As a consequence, both tests produce more false positives, as we have demonstrated in our manuscript (see Figure 2 of our publication). Whether you see a credible change in your data depends on a variety of factors like the number of replicates, variability across replicates, or abundance of the cell type of interest. In general, it is harder to detect credible changes in lowly abundant cell types. If that is the case for the CD8 effector T cells, I recommend to increase the FDR to 0.2 (as you already did). You can further check which cell type has been selected as baseline for your comparison.

2) FACS data are slightly different from single-cell RNA-sequencing data, because we observe a strict limit in the sequencing capacity for systems like the 10X Genomics Chromium platform, which is most widely used. Therefore, we take into account the total number of sequenced cells per sample as a parameter in the scCODA model. In contrast, FACS data allow for a more flexible measurement in the amount of cells. Due to the flexible limit of cells per measurement in FACS, using scCODA might be ill-advised (as in: We have not tested it nor designed the model for this data type.). In general, when we examine cell proportions in FACS data, we are still looking at compositional data. This would also require specifically tailored compositional models instead of the widely used t-test. The modeling of proportions in FACS has been covered in the literature in a variety of approaches, so we did not address it in our work. Moreover, in the example of supercentenarian data, the difference is so strong that it can be recovered by any test regardless.

I hope that helps! Best, Maren

FionaL720 commented 2 months ago

Hi Marwan, I'm also trying to apply scCODA to my scRNA-seq analyses and have some questions regarding interpreting the output. It turns out that the effect for my cell clusters starts to appear at 0.27 FDR. Given that the sample size is small (n=5), what should be a reasonable or acceptable FDR? Best, Fiona

mbuttner commented 2 months ago

Hi @FionaL720

thank you for sharing your issue. I have some questions: Do you have n=5 samples total or per condition? What is the overall abundance of your clusters? How many clusters do you have in your data? These factors influence strongly the effect size of the abundance test.

Best, Maren

FionaL720 commented 2 months ago

Hi Maren @mbuttner

Thank you so much for your reply! I have n=5 per condition and 2 conditions in total. The average abundance of my cluster is about 0.04 and 0.14 in condition 1 and condition 2, respectively. I have 13 clusters in my data. Thanks again for your kind help!

Best, Fiona

mbuttner commented 2 months ago

Hi Fiona, OK, that sounds reasonably powered overall. Here's what I would adjust in scCODA to improve the FDR:

  1. Check the selected reference cell type. Every compositional abundance method has a reference cell type, which represents the baseline of how "no change in abundance" looks like in the model. You can manually set it to a specific cluster or scan over all clusters, if you are unsure. We observed that the credible changes in abundance are similar when different "non-changing" cell types are selected. That means that the model is stable towards changes in the reference cell type as long as a "non-changing" cell type is selected, but the reference cell type should have a sufficient abundance itself. The automatic reference selection should take it into account, but it doesn't hurt to play around here.
  2. Increase the chain length for parameter estimation. With 13 cell types, you can consider to increase the length of the MCMC/HMC chains used for the parameter inference. By default, one uses 50,000 iterations, but you could increase them in your setting to 100,000. In theory, you can train even longer chains, but I don't expect any change in credibility after that.
  3. Inspect the chains for artifacts. Every scCODA run returns the acceptance rate of the parameter inference. As a rule of thumb, it should be above 50%. If it is somewhere around 50%, you can further inspect the MCMC/HMC chains and visualize the parameter distributions as shown in our analysis tutorial in the arviz trace plots. If you observe that your MCMC/HMC chain has "flat-lined" during the inference (we refer to this phenomenon as "the chain got stuck"), it means that an extensive amount of parameters that contributed to the parameter inference remain 0 when in fact the parameter should be different from 0. If you observe such "flat lines", please rerun the model.

I hope this helps! Best, Maren

FionaL720 commented 2 months ago

Hi Maren, Thank you for the invaluable suggestions!! It really helps a lot. I'll try these methods. Best, Fiona