Open kayhan-batmanghelich opened 7 years ago
Hi there,
There are a few ways to access the posterior row and column partitions. The easiest way is to use the X_L
and X_D
dictionaries that summarize a particular CrossCat latent state (which is one posterior sample). These two dictionaries are returned by Engine.initialize
and Engine.analyze
, shown below for the EngineTemplate
base class:
https://github.com/probcomp/crosscat/blob/master/src/EngineTemplate.py#L31 https://github.com/probcomp/crosscat/blob/master/src/EngineTemplate.py#L39
X_L['column_partition']['assignments']
.X_D[view_idx]
will contain the row partition in the given view.The MATLAB implementation is quite deprecated.
Please let us know if you have more questions.
@fsaad
Thank you for your reply. I have a few of questions:
X_L
and X_D
equal the number of chains but the chains are not drawn from the same models because the hyper-parameters are different, correct? How do you specify the hyper-parameters?X_D
, X_L
contain one posterior sample. To keep sampling from the posterior, do we need to rerun the model from the prior? I don't see any from_the_posterior
in the engine.initialize
; it only has from_the _prior
option. Sampling from the scratch would be time-consuming. I would like to use corsscat
inside of another model. Is there anyway?Thanks,
Length of the X_L and X_D equal the number of chains but the chains are not drawn from the same models because the hyper-parameters are different, correct? How do you specify the hyper-parameters?
Each chain is initialized by forward sampling all latent variables that comprise the CrossCat prior. There are many latent variables specified by CrossCat, and these are:
To keep sampling from the posterior, do we need to rerun the model from the prior?
Each chain begins as an independent realization of all the latent variables from the prior. That is what happens when you call X_L, X_D, = engine.initialize(M_c, M_r, T, seed)
.
You can sample from the posterior by using engine.analyze
which takes in X_L
and X_D
, runs posterior inference, and returns X_L_new
and X_D_new
. You can then pick-up inference from where it left off by calling engine.analyze
again, providing it with X_L_new
and X_D_new
(which are not prior samples, so analysis does not starting from scratch). You can find an example this workflow here:
https://github.com/probcomp/crosscat/blob/master/src/tests/crosscat_client_runner.py#L41-L44
@fsaad
Thanks for the quick reply.
hyper_param
, I was referring this:print "size of the chain : ", len(X_L_list2)
print "X_L_list2[0]['column_hypers'] :" , len(X_L_list2[0]['column_hypers'])
print "len(X_L_list2[0]['column_hypers']) :" , len(X_L_list2[0]['column_hypers'])
print "X_L_list2[0]['column_hypers'][-1] :" , X_L_list2[0]['column_hypers'][-1]
print "X_L_list2[1]['column_hypers'][-1] :" , X_L_list2[1]['column_hypers'][-1]
size of the chain : 25
X_L_list2[0]['column_hypers'] : 64
len(X_L_list2[0]['column_hypers']) : 64
X_L_list2[0]['column_hypers'][-1] : {'mu': 78.60333333333332, 's': 401.091946721303, 'r': 1.0, 'fixed': 0.0, 'nu': 17.521415467935228}
X_L_list2[1]['column_hypers'][-1] : {'mu': 78.60333333333332, 's': 117.4655749680299, 'r': 0.6826384947222036, 'fixed': 0.0, 'nu': 8.164897511055578}
Do you see s
and r
and nu
are different for chain 0
and chain 1
? Are those samples from distribution on the top of the column_hypers
? If so, where do you specify hyper-params if the columns_hypers
? I don't see that in the API.
How to choose one of those chains to draw more posterior samples from? In variational Bayes, we use Evidence Lower Bound (ELBO). We usually choose the one with the highest ELBO. For MCMC, we usually eye-ball the trace plot of the chain to see which one is converged. How do you suggest to selected one of the chains?
If I understand it correctly, the example shows that to sample newX_L
we don't need to make a n_steps
high, right? unless there is some internal mixing process.
Thanks you,
Do you see s and r and nu are different for chain 0 and chain 1 ? Are those samples from distribution on the top of the
column_hypers
?
Yes, each chain will typically, but not necessarily, have different values for column hyperparameters. I do not quite understand the second question, in particular the phrase "from distribution on the top of the column_hypers
.
How to choose one of those chains to draw more posterior samples from?
There are at least two mechanisms for obtaining say K samples from MCMC inference algorithms:
Your question seems to suggest a hybrid of these two, that is, you wish to select from one of the K independent chains, and then thin it to get a collection of uncorrelated samples. Now, you can just take the hyperparameters from each chain, and treat those as your bundle of posterior samples.
If you really wish to select a single chain, in the current framework of independent MCMC inference I am not aware of any formally correct techniques to do so. One approach which is theoretically sound would be to implement an SMC scheme for CrossCat using say K particles, have the data annealed one at a time, and then re-sample (with replacement) the chains at each step after some intermediate amount of Gibbs. However, this version of CrossCat does not have this inference algorithm, and it could be challenging to implement.
As for selecting a single chain using non-formal techniques, there are heuristics galore. Some approaches might be to consider the log-score; to keep a held-out dataset and study at the predictive likelihood; to investigate the dependencies found by each chain and select the one which best matches your domain knowledge or expectation; etc.
If I understand it correctly, the example shows that to sample newX_L we don't need to make a n_steps high, right? unless there is some internal mixing process.
The example that I linked to is purely illustrative, to show how to use the software API. It makes no assumptions or guarantees about inference quality, or the amount of steps you need to run. Typically this will be dominated by your data analysis application at hand.
@fsaad Thanks.
I am trying to sample 10K partitions from the posterior distribution. To give a rough idea of how it takes for my problem, it takes about 10hr for 200 transitions. Therefore, running 10K chains is infeasible. This is the strategy that I am thinking:
I am still struggling with generating one sample after 1K transitions because the job ran for almost 3 days and not finished yet. Two questions:
max_iterations
, I couldn't find an explanation of this option.Thanks,
it takes about 10hr for 200 transitions
My sense is that you might be using a dataset size which is larger than the target of this software --- how many rows and columns are you trying to learn?
There is also max_iterations, I couldn't find an explanation of this option.
I believe the max_iterations
is not used by the system. You can try to use the max_seconds
instead, which will terminate after the given amount of seconds.
Is there any way to see the trace plot, perhaps 10K transition is too much.
If you run LocalEngine.analyze
with do_diagnostics=True
then the latent variables at each step will be saved in a dictionary. The return value of analyze
will then be a 3-tuple of the form:
X_L_new, X_D_new, diagnostics_new
.
Hi @fsaad
Thank you reply. Actually, the matrix is not that large, it is 19992x105 . Based on what I read on the documentation, it is roughly in the order that crosscat can handle but if you think it is not appropriate or it is too big, please let me know.
Best,
On Thu, Jun 29, 2017 at 11:17 AM, F Saad notifications@github.com wrote:
it takes about 10hr for 200 transitions
My sense is that you might be using a dataset size which is larger than the target of this software --- how many rows and columns are you trying to learn?
There is also max_iterations, I couldn't find an explanation of this option.
I believe the max_iterations is not used by the system. You can try to use the max_seconds instead, which will terminate after the given amount of seconds.
Is there any way to see the trace plot, perhaps 10K transition is too much.
If you run LocalEngine.analyze with do_diagnostics=True then the latent variables at each step will be saved in a dictionary. The return value of analyze will then be a 3-tuple of the form: X_L_new, X_D_new, diagnostics_new.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/probcomp/crosscat/issues/112#issuecomment-311999043, or mute the thread https://github.com/notifications/unsubscribe-auth/ALu03aQwayNXmgJoboshT7-MKOc3yeXiks5sI7_zgaJpZM4OCgJ6 .
Thanks @fsaad , diagnostics_new
is exactly what I want. The logscore
can be used to monitor the quality of the chain (like ELBO).
Hi @fsaad
It seems that max_time
is respected in LocalEngine
but not in the MultiprocessorEngine
. Am I missing something?
@kayhan-batmanghelich I'm not sure immediately why MultiprocessingEngine is not respecting max_time
, please open a separate ticket so we can track the issue, and let me know if it is OK to close this one.
@fsaad
Sure, I will open a new ticket. Please close this one.
Thanks
Hi,
I was wondering how to sample Posterior partitioning of the rows/columns (cluster assignment)? Is this feature supported? If not, is there any to do that? Theoretically, it should be possible but I couldn't find anything in the documentation. It seems it was possible in MATLAB version, for example here:
https://github.com/probcomp/crosscat/blob/2de0192bbbc7118a947d1307724336aab6465945/legacy/crosscat_matlab/runModel.m#L274
Thanks,