DanteArucard commented 7 years ago

I modified simple demo to check how accurate was the model reconstruction.

I found that even with simple configuration the eigenvalues do not match.

The number of hidden states,K, 1 and both the test and true models where of class DefaultSLDS

mattjj commented 7 years ago

Is there a reason you think they should match? Maybe in the infinite data limit, if you estimate the parameters with maximum likelihood, though I'm not 100% sure about that.

DanteArucard commented 7 years ago

If I'm not wrong in maximum likelihood with a simple system like this the model should converge quite fast because the dimension of the LTI is predefined (2) I have a sequence of 8000 timesteps and I run resample_model 4000. So I expected that the eigenvalues would get near to the real ones or show consistent values over multiple independent estimations. Now with original eigenvalues =[ 0.5 0.375] over optimization 2 runs I obtain: run 2 [ 0.09078091 0.59991141] run 1 [ 0.57639038 -0.01080844]

Now with original eigenvalues =[ 0.5 0.5] over 2 optimization runs I obtain: run 2 [ 0.15051347 0.65322825] run 1 [ 0.62606788 0.12336595]

Looks like there is a prior on the models to obtain eigenvalues that are different and related. How do you set the model priors?

My intentions is to learn the model of a process and using the learnt model to do online predictions after observing a sequence of input output steps I assume that to obtain reliable predictions at least the eigenvalues should be similar

mattjj commented 7 years ago

If you want to use maximum likelihood with no switching, you should probably use pylds and the EM algorithm.

Whether a dynamics estimate is consistent would have to depend on other factors, like whether the system that generated the data is observable.

Priors wouldn't matter with EM, but just look at the implementation of DefaultLDS; the nu_0, S_0, M_0, and K_0 arguments to the Regression constructor are the matrix normal inverse Wishart prior hyperparameters.

DanteArucard commented 7 years ago

Thanks a lot, sorry for taking your time.

I actually need the Switching LDS but I wanted to check how the system would behave in the simplest case.

The system I use for generating data in the simple_demo modification should be observable:

K = 1
D_obs = 1
D_latent = 2
D_input = 0
T = 8000

# Simulate from one LDS
true_mu_inits = [np.ones(D_latent) for _ in range(K)]

true_sigma_inits = [0.05 * np.eye(D_latent) for _ in range(K)]

true_As[0]=0.5*np.array([[1, 1],[0, 0.75]])

true_sigma_states = [0.05 * np.eye(D_latent) for _ in range(K)]

true_C = np.ones((D_obs, D_latent))

true_sigma_obs = 0.05 * np.eye(D_obs)

true_model =   DefaultSLDS(
    K, D_obs, D_latent, D_input=D_input,
    mu_inits=true_mu_inits, sigma_inits=true_sigma_inits,
    As=true_As, sigma_statess=true_sigma_states,
    Cs=true_C, sigma_obss=true_sigma_obs)
    # kappa=1000.)

PS what are mu_inits, sigma_inits and init_dynamics_distns? How is the uncertainty on the model and on the initial state encoded?

mattjj commented 7 years ago

No problem, this would be good to figure out :)

That example should indeed be observable.

Another thing to keep in mind is that EM (and other alternating algorithms like Gibbs) can get stuck in local optima, so it depends on where you initialize. One way to initialize is with something like 4SID, which is asymptotically consistent (see also Byron Boots's thesis). I started an implementation to use with pylds / pyslds but never finished it. You could just try initializing near the answer.

The priors might also be unintentionally strong, but I would have to dig into the code to understand them. T=8000 seems like it would be plenty of data to overwhelm a reasonable prior.

The initial dynamics distributions (one for each state, each parameterized by a mu_init and sigma_init) are distributions for p(x_1 | z_1, parameters), i.e. the continuous state component at time t=1 given the discrete state component at time t=1. Those are Gaussian distributions.

The representation of uncertainty depends on what algorithm you're using. If you're using EM, there is uncertainty explicitly represented in the state but not in the parameter estimates. If you're using variational mean field, the uncertainty on the parameters is estimated via variational factor on the parameters in the same form as the prior (e.g. a matrix normal inverse Wishart for each dynamics matrix). If you're using block Gibbs sampling, then the uncertainty is encoded in the iterates of the Markov chain, so that you can form MCMC estimates using those samples (e.g. to estimate posterior variance of some parameter, you would compute the sample variance of the Gibbs iterates of that parameter). Each call to resample_model is one Gibbs sweep.

Hope that helps. I don't have time to dig into the details of your example right now, but let's keep this issue open until you solve it.

DanteArucard commented 7 years ago

Thank you so much,

just for completeness these are the details I used for the training: inputs = npr.randn(T, D_input) y, x, z = true_model.generate(T, inputs=inputs) test_model = DefaultSLDS(K, D_obs, D_latent, D_input,Cs=true_C, sigma_obss=true_sigma_obs) test_model.add_data(y, inputs=inputs) N_samples = 4000 def update(model): model.resample_model() return model.log_likelihood()

lls = [update(test_model) for _ in progprint_xrange(N_samples)] Thanks a lot, Dante

DanteArucard commented 7 years ago

As you suggested the problem probably is using the gibbs sampling approximation. But it is not trivial to extract a mixture parameters from samples due to label switching.

I tried to run the example meanfield.py but I get this error

$ python meanfield.py Traceback (most recent call last): File "meanfield.py", line 6, in <module> from autoregressive.distributions import AutoRegression ImportError: No module named autoregressive.distributions

And I fail installing the git clone git://github.com/mattjj/pyhsmm-autoregressive.git which I guess is what is missing.

the error is: `building 'autoregressive.messages' extension cc -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/Library/Python/2.7/site-packages/numpy/core/include -Ideps -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c autoregressive/messages.cpp -o build/temp.macosx-10.11-intel-2.7/autoregressive/messages.o -O2 -std=c++11 -DEIGEN_NO_MALLOC -DNDEBUG -w In file included from autoregressive/messages.cpp:476: autoregressive/messages.h:3:10: fatal error: 'omp.h' file not found

include // omp_get_num_threads, omp_get_thread_num`

DanteArucard commented 7 years ago

Hi Matthew,

Using a code similar to that of simple_demo.py I compared gibbs (MC) and the meanfield (MF) computing the permutation of the hidden states and of their eigenvalues that had minimum distance (eig distance) from that of the generating process.

Varying K (number of discrete states) I obtained:

K=1
MF results iterations 400 optimal eig distance 0.168222341047
MF results iterations 400 optimal eig distance 0.173537924623

MC results N_samples=2000  optimal eig distance 0.44206576271
MC results N_samples=2000 optimal eig distance 0.217700008228

Here MF gets consistently better results than MC

K=2
MF results iterations 800 optimal eig distance 0.952534072764
MF results iterations 800 optimal eig distance 0.952649413722

MC results N_samples=2000 optimal eig distance 0.17666540212
MC results N_samples=2000 optimal eig distance 1.05585382571

Here MC occasionally shows much better results than MF

K=3
MF results iterations 1200 optimal eig distance 1.1253142908

MF results iterations 1200 optimal eig distance 1.14856115732

MC results N_samples=2000 optimal eig distance 0.280453989695
MC results N_samples=2000 optimal eig distance 0.577290220684

In this case MF is behaving worse than MC

I would still prefer MF as it should not be affected by label switching problems as MC probably is (but I'm not sure if it is the case). I thought that an explicit state duration model or at least a sticky HMM would help learning a more precise model.

One approach to extract the models from MC even in face of label switching (which is also relevant for this issue ) would be to try to align the eigenvalues as I did to find the permutation with minimum distance on the eigenvalues and use the corresponding transformation to combine models extracted from different samples.

I also still don't understand the poor performance of the MC when K is 1. And I think that haivng MF or another inference mechanism other than montcarlo could be useful with the sticky models

mattjj / pyslds

Eignevalues of learnt system do not match those of the generating process #14

include // omp_get_num_threads, omp_get_thread_num`