Identifiability constraints and complexity

pedrofale commented 6 years ago

I am trying to replicate the experiments done in section 6.1 of your paper.

I have a question about the identifiability constraints you put on the alpha and beta scaling parameters: how do you decide what values of maximum and minimum to impose on those parameters for synthetic data? I see that in your code, for real gene expression data, you use the library size in order to make this decision. But what about in synthetic data generated according to your model?
What was the run time for fitting BISCUIT and HDPMM to the generated data with 100 samples, 50 dimensions and 3 components? Did you enforce some kind of regularization on the covariance matrices to reduce complexity, like making them diagonal? Inverting high-dimensional matrices takes quite a while...
In your code I don't see how you are using the Hierarchical Conditionally Conjugate DPMM model with cell-specific scalings. Where are the hyperparameter updates of psi, mu_prime, Sigma_prime, H_prime and sigma_prime, as in the plate model in Figure 4??

sandhya212 commented 6 years ago

Hi Pedro,

Synthetic data: We generate multivariate Gaussian data using known means, covariances, alphas and betas. Then for inferring alphas and betas, we rely again on the library size (as in real world data) and see if we infer the values we started with.
In the synthetic experiments, where covariances were 50x50, I did not enforce any constraints other than that they have to be invertible. Once the covariance dimensions increaase, then yes, we hit the problem. HDPMM was faster (~10 mins) than BISCUIT given BISCUIT had to infer the alphas and betas as well but BISCUIT gave better results.
For the code that we currently have on Github, I had to remove the hyperparameter layer to cater to high-dimensionality issues. We could add these in, just that it would take much longer for convergence.

pedrofale commented 6 years ago

Dear Sandhya,

Thank you for your response! Your work is crucial for my MSc thesis so I am attempting at implementing everything from scratch to make sure I understand all the characteristics of BISCUIT.

So regarding the identifiability constraints: the conditions you state in the paper at Theorem 2 of section 4.1 regarding mu_k and beta_j are never used? I can't find them in your code -- why did you not use these conditions and instead resorted to the implemented ones?
Thanks!
Removing the hyperparameter layer updates has implications in the posterior distributions of the parameters, right? Did you still use the Conditionally Conjugate model or instead used the basic fully conjugate model with a NIW prior on the means and covariances? And is this the configuration you used to get the results presented in the manuscript?

Thank you for your availability!

sandhya212 commented 6 years ago

No worries. The ICML paper detailed a working case for 3000 cells and ~ 500 genes. Once we go on to increasing the cells or genes, the mixture model struggles/breaks. The code you find on Github is tailored for such high-dimensional situations and was done post ICML. Currently the model does not have the hyperparameter layer (and as I mentioned, you can always add it but then the tradeoff is that of slowing the model) and goes by a fully-conjugate model with NIW prior. We still will get results as presented in the ICML paper.

pedrofale commented 6 years ago

Thank you! I'll get back to you if I have further questions. Your insight has been valuable.

pedrofale commented 6 years ago

Hello Sandhya,

I am trying to replicate the experiments with synthetic data. You said earlier that to estimate the alphas and betas you rely on the library sizes even for synthetic data. I am doing this and I don't get good estimates for neither alpha or beta. Could you provide a code snippet for this simple task? Thanks.

PS: I have a working example of generating data with the HDPMM and estimating its parameters. The problem I am having with BISCUIT is definitely in estimating alphas and betas.

pedrofale commented 6 years ago

Hello Sandhya,

I still haven't been able to correctly estimate the parameters from data generated according to Appendix A in the supplementary material of the ICML paper. I am using the posterior expressions derived in Appendix C. I am generating alphas and betas according to pre-defined hyperparameters for their prior distributions and am initiating inference by using those same hyperparameters. Is there any assumption I need to make when generating the data? It's really weird that I can't estimate the correct cell-specific parameters. In particular I am setting betas to be 1 and only trying to estimate the alphas (the generated data was also generated with betas=1) and not once has the algorithm converged to the right parameters of mean, cov and alpha.

Without alphas and betas it works fine though.

Hints or code would be much appreciated!

Pedro

sandhya212 commented 6 years ago

Hi Pedro, Sorry for the delay in getting back. In case you still needed help, please find some hints below. How have you simulated data? Simulate alphas and betas for each cell within a cluster but with some variance amongst the alphas (and betas) within a cluster. This is the real-world setting we are aiming to capture. You may run the sampler for a few iterations without the alphas and betas just so that the cluster moments are approximated (=not a cold start) before injecting the contribution of alphas and betas.

sandhya212 / BISCUIT_SingleCell_IMM_ICML_2016

Identifiability constraints and complexity #7