T-matrix sampler fails input check for pathologic (but correct) count matrices

franknoe commented 8 years ago

In the T-matrix sampler it is being checked whether the sparsity pattern of the initial transition matrix and the count matrix are consistent. One would think that this is always the case when we estimate the transition matrix ourselves (i.e. when the input parameter for the initial T-matrix is 'None'). However it can happen that for count matrix with very small fractional counts (which do occur e.g. in HMM estimation), the corresponding transition matrix gets zeros at the corresponding elements (probably underflow).

I can handle and avoid this problem in the HMM code, but I think it should be handled in the msmtools because it's an internal inconsistency.

I suggest the following solution: only check for consistency between C and T if T is an input. When T is estimated here, use the sparsity pattern of T. Check for connectivity before starting the sampler, and only raise an exception if the connectivity is lost.

Example: the following input

C = np.array([[  3.37739137e+003,   1.64131982e-116,   8.74913916e-223,   3.90000000e+001,    0.00000000e+000],
              [  4.22935432e-117,   2.91538610e+003,   1.82706999e-002,   4.09817293e+001,    6.18100826e-319],
              [  1.51045213e-224,   1.22409738e-003,   2.05824173e+003,   3.02245113e+000,    4.79763248e+001],
              [  6.16365715e+000,   2.06235135e+001,   8.21282933e+000,   4.90738509e+003,    3.30154263e-142],
              [  0.00000000e+000,   4.20252238e-320,   3.00000000e+000,   3.56721479e-143,    3.44204148e+003]])
msmest.sample_tmatrix(C, nsample=1, nsteps=10, reversible=True)

will fail with

/Users/noe/data/software_projects/msmtools/msmtools/estimation/dense/sampler_rev.pyx in msmtools.estimation.dense.sampler_rev.SamplerRev.check_input (msmtools/estimation/dense/sampler_rev.c:3098)()
    122         iV, jV = np.where( (self.V+self.V.T)>0 )
    123         if not np.array_equal(iC, iV):
--> 124             raise ValueError('Sparsity patterns of C and X are different.')
    125         if not np.array_equal(jC, jV):
    126             raise ValueError('Sparsity patterns of C and X are different.')

ValueError: Sparsity patterns of C and X are different.

Because the estimated transition matrix of C has zeros at elements [1,4] and [4,1](probably from an underflow).

fabian-paul commented 8 years ago

The sparsity-pattern of P is unreliable, because it depends on the maximum number of iterations, the ML estimator is instructed to perform. Does the sparsity-pattern of matrices generated by the sampler depend on the initial guess of P? That is, if some connectivity is lost in one sample, is it lost as well in all subsequent P samples?

marscher commented 5 years ago

fixed by #91

markovmodel / msmtools

T-matrix sampler fails input check for pathologic (but correct) count matrices #70