theislab / scCODA

A Bayesian model for compositional single-cell data analysis
BSD 3-Clause "New" or "Revised" License
147 stars 24 forks source link

Optimal tuning of parameters for estimation #33

Closed fbrundu closed 2 years ago

fbrundu commented 3 years ago

Hi,

I am running a small chain to infer cell type contributions in a dataset composed by 8 samples, with 14 cell types. I get the following summary:

Compositional Analysis summary (extended):

Data: 8 samples, 14 cell types
Reference index: 7
Formula: C(condition, Treatment("WT"))
Spike-and-slab threshold: 0.642

MCMC Sampling: Sampled 20000 chain states (5000 burnin samples) in 128.565 sec. Acceptance rate: 57.8%

where the acceptance rate in 57.8%. How can I make sure that I explored correctly the posterior distribution and I don't need a longer chain? Is an acceptance rate of 57.8% in line with a correct estimation or are there additional parameters that need to be tuned for this model?

Thanks! Francesco

johannesostner commented 3 years ago

Hello!

An acceptance rate of 57.8% is in line with what we usually see in a converged chain that explores the parameter space well. From our experience, a non-converged chain usually has an acceptance rate of less than 35%. Keep in mind that this is only a heuristic, but it is usually enough.

You can also visually assess the inference quality by the traceplots, as shown in our advanced tutorial, section "Diagnostics and plotting". Also, the result object you receive after running sample_hmc supports all other functionalities from arviz, which you may find useful.

fbrundu commented 3 years ago

Thanks for your reply. Do you think future versions of scCODA will provide additional diagnostics like posterior predictive checks, or more elaborate tutorials on diagnostics to make sure that the chains have converged, and the estimation has to be trusted? I think that it might help since some of the people that will use it might not fully understand the details and implications of this model. In particular, I noticed in a previous run (with the same parameters but different chain length) I get slightly different results regarding a credible effect for a specific cell type (I haven't been able to reproduce it consistently, unfortunately). It's totally fine if you don't plan to, but in any case, thank you for this work!

johannesostner commented 3 years ago

Thanks for these suggestions!

We plan to continuously develop and improve scCODA even after publication, and model diagnostics are one point that we want to address in the future. Most diagnostic tools (like R-hat) are only really meaningful when running multiple MCMC chains, so we will look into it again, once we get around to this.