probcomp / crosscat

A domain-general, Bayesian method for analyzing high-dimensional data tables
http://probcomp.csail.mit.edu/crosscat/
Apache License 2.0
322 stars 42 forks source link

How to sample from the posterior #112

Open kayhan-batmanghelich opened 7 years ago

kayhan-batmanghelich commented 7 years ago

Hi,

I was wondering how to sample Posterior partitioning of the rows/columns (cluster assignment)? Is this feature supported? If not, is there any to do that? Theoretically, it should be possible but I couldn't find anything in the documentation. It seems it was possible in MATLAB version, for example here:

https://github.com/probcomp/crosscat/blob/2de0192bbbc7118a947d1307724336aab6465945/legacy/crosscat_matlab/runModel.m#L274

Thanks,

fsaad commented 7 years ago

Hi there,

There are a few ways to access the posterior row and column partitions. The easiest way is to use the X_L and X_D dictionaries that summarize a particular CrossCat latent state (which is one posterior sample). These two dictionaries are returned by Engine.initialize and Engine.analyze, shown below for the EngineTemplate base class:

https://github.com/probcomp/crosscat/blob/master/src/EngineTemplate.py#L31 https://github.com/probcomp/crosscat/blob/master/src/EngineTemplate.py#L39

The MATLAB implementation is quite deprecated.

Please let us know if you have more questions.

kayhan-batmanghelich commented 7 years ago

@fsaad

Thank you for your reply. I have a few of questions:

Thanks,

fsaad commented 7 years ago

Length of the X_L and X_D equal the number of chains but the chains are not drawn from the same models because the hyper-parameters are different, correct? How do you specify the hyper-parameters?

Each chain is initialized by forward sampling all latent variables that comprise the CrossCat prior. There are many latent variables specified by CrossCat, and these are:

To keep sampling from the posterior, do we need to rerun the model from the prior?

Each chain begins as an independent realization of all the latent variables from the prior. That is what happens when you call X_L, X_D, = engine.initialize(M_c, M_r, T, seed).

You can sample from the posterior by using engine.analyze which takes in X_L and X_D, runs posterior inference, and returns X_L_new and X_D_new. You can then pick-up inference from where it left off by calling engine.analyze again, providing it with X_L_new and X_D_new (which are not prior samples, so analysis does not starting from scratch). You can find an example this workflow here:

https://github.com/probcomp/crosscat/blob/master/src/tests/crosscat_client_runner.py#L41-L44

kayhan-batmanghelich commented 7 years ago

@fsaad

Thanks for the quick reply.

print "size of the chain : ",  len(X_L_list2)
print "X_L_list2[0]['column_hypers'] :" , len(X_L_list2[0]['column_hypers'])
print "len(X_L_list2[0]['column_hypers']) :" , len(X_L_list2[0]['column_hypers'])
print "X_L_list2[0]['column_hypers'][-1] :" , X_L_list2[0]['column_hypers'][-1]
print "X_L_list2[1]['column_hypers'][-1] :" , X_L_list2[1]['column_hypers'][-1]
size of the chain :  25
X_L_list2[0]['column_hypers'] : 64
len(X_L_list2[0]['column_hypers']) : 64
X_L_list2[0]['column_hypers'][-1] : {'mu': 78.60333333333332, 's': 401.091946721303, 'r': 1.0, 'fixed': 0.0, 'nu': 17.521415467935228}
X_L_list2[1]['column_hypers'][-1] : {'mu': 78.60333333333332, 's': 117.4655749680299, 'r': 0.6826384947222036, 'fixed': 0.0, 'nu': 8.164897511055578}

Do you see s and r and nu are different for chain 0 and chain 1 ? Are those samples from distribution on the top of the column_hypers ? If so, where do you specify hyper-params if the columns_hypers ? I don't see that in the API.

Thanks you,

fsaad commented 7 years ago

Do you see s and r and nu are different for chain 0 and chain 1 ? Are those samples from distribution on the top of the column_hypers?

Yes, each chain will typically, but not necessarily, have different values for column hyperparameters. I do not quite understand the second question, in particular the phrase "from distribution on the top of the column_hypers.

How to choose one of those chains to draw more posterior samples from?

There are at least two mechanisms for obtaining say K samples from MCMC inference algorithms:

Your question seems to suggest a hybrid of these two, that is, you wish to select from one of the K independent chains, and then thin it to get a collection of uncorrelated samples. Now, you can just take the hyperparameters from each chain, and treat those as your bundle of posterior samples.

If you really wish to select a single chain, in the current framework of independent MCMC inference I am not aware of any formally correct techniques to do so. One approach which is theoretically sound would be to implement an SMC scheme for CrossCat using say K particles, have the data annealed one at a time, and then re-sample (with replacement) the chains at each step after some intermediate amount of Gibbs. However, this version of CrossCat does not have this inference algorithm, and it could be challenging to implement.

As for selecting a single chain using non-formal techniques, there are heuristics galore. Some approaches might be to consider the log-score; to keep a held-out dataset and study at the predictive likelihood; to investigate the dependencies found by each chain and select the one which best matches your domain knowledge or expectation; etc.

If I understand it correctly, the example shows that to sample newX_L we don't need to make a n_steps high, right? unless there is some internal mixing process.

The example that I linked to is purely illustrative, to show how to use the software API. It makes no assumptions or guarantees about inference quality, or the amount of steps you need to run. Typically this will be dominated by your data analysis application at hand.

kayhan-batmanghelich commented 7 years ago

@fsaad Thanks.

I am trying to sample 10K partitions from the posterior distribution. To give a rough idea of how it takes for my problem, it takes about 10hr for 200 transitions. Therefore, running 10K chains is infeasible. This is the strategy that I am thinking:

I am still struggling with generating one sample after 1K transitions because the job ran for almost 3 days and not finished yet. Two questions:

Thanks,

fsaad commented 7 years ago

it takes about 10hr for 200 transitions

My sense is that you might be using a dataset size which is larger than the target of this software --- how many rows and columns are you trying to learn?

There is also max_iterations, I couldn't find an explanation of this option.

I believe the max_iterations is not used by the system. You can try to use the max_seconds instead, which will terminate after the given amount of seconds.

Is there any way to see the trace plot, perhaps 10K transition is too much.

If you run LocalEngine.analyze with do_diagnostics=True then the latent variables at each step will be saved in a dictionary. The return value of analyze will then be a 3-tuple of the form: X_L_new, X_D_new, diagnostics_new.

kayhan-batmanghelich commented 7 years ago

Hi @fsaad

Thank you reply. Actually, the matrix is not that large, it is 19992x105 . Based on what I read on the documentation, it is roughly in the order that crosscat can handle but if you think it is not appropriate or it is too big, please let me know.

Best,

On Thu, Jun 29, 2017 at 11:17 AM, F Saad notifications@github.com wrote:

it takes about 10hr for 200 transitions

My sense is that you might be using a dataset size which is larger than the target of this software --- how many rows and columns are you trying to learn?

There is also max_iterations, I couldn't find an explanation of this option.

I believe the max_iterations is not used by the system. You can try to use the max_seconds instead, which will terminate after the given amount of seconds.

Is there any way to see the trace plot, perhaps 10K transition is too much.

If you run LocalEngine.analyze with do_diagnostics=True then the latent variables at each step will be saved in a dictionary. The return value of analyze will then be a 3-tuple of the form: X_L_new, X_D_new, diagnostics_new.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/probcomp/crosscat/issues/112#issuecomment-311999043, or mute the thread https://github.com/notifications/unsubscribe-auth/ALu03aQwayNXmgJoboshT7-MKOc3yeXiks5sI7_zgaJpZM4OCgJ6 .

kayhan-batmanghelich commented 7 years ago

Thanks @fsaad , diagnostics_new is exactly what I want. The logscore can be used to monitor the quality of the chain (like ELBO).

kayhan-batmanghelich commented 7 years ago

Hi @fsaad

It seems that max_time is respected in LocalEngine but not in the MultiprocessorEngine. Am I missing something?

fsaad commented 7 years ago

@kayhan-batmanghelich I'm not sure immediately why MultiprocessingEngine is not respecting max_time, please open a separate ticket so we can track the issue, and let me know if it is OK to close this one.

kayhan-batmanghelich commented 7 years ago

@fsaad

Sure, I will open a new ticket. Please close this one.

Thanks