hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Disabling Gaussian splitting in single-Gaussian case #133

Open EricR86 opened 5 years ago

EricR86 commented 5 years ago

Original report (BitBucket issue) by Rachel Chan (Bitbucket: rcwchan).


Since the mixture of Gaussians (GMM) patch, we have seen that it is possible for Gaussian models to split/vanish components once the model's confidence becomes high enough. The issue is that this behavior may not be backwards compatible with Segway pre-GMM.

However, I am not sure how this behavior was enabled. Page 551 of GMTK's documentation states that one just needs to add the MX table to the input.master to enable GMM. However, Segway's commit history (eg for simpleseg's input.master) indicates that the MX table has always been present in Segway. To me, this indicates that Segway was a 1D mixture model all along, and splitting/vanishing has always been possible.

In which case, is splitting/vanishing now enabled instead due to the changes to the dpmf structure? Specifically, Segway used to use a single dpmf constant 'dpmf_always' across all its Gaussians:

#!C

DPMF_IN_FILE inline
1

0 dpmf_always 1 1.0

MX_IN_FILE inline
8

0 1 mx_seg0_subseg0_testtrack1 1 dpmf_always mc_asinh_norm_seg0_subseg0_testtrack1
1 1 mx_seg0_subseg0_testtrack2 1 dpmf_always mc_asinh_norm_seg0_subseg0_testtrack2
2 1 mx_seg1_subseg0_testtrack1 1 dpmf_always mc_asinh_norm_seg1_subseg0_testtrack1
3 1 mx_seg1_subseg0_testtrack2 1 dpmf_always mc_asinh_norm_seg1_subseg0_testtrack2
4 1 mx_seg2_subseg0_testtrack1 1 dpmf_always mc_asinh_norm_seg2_subseg0_testtrack1
5 1 mx_seg2_subseg0_testtrack2 1 dpmf_always mc_asinh_norm_seg2_subseg0_testtrack2
6 1 mx_seg3_subseg0_testtrack1 1 dpmf_always mc_asinh_norm_seg3_subseg0_testtrack1
7 1 mx_seg3_subseg0_testtrack2 1 dpmf_always mc_asinh_norm_seg3_subseg0_testtrack2

If I understand GMTK structure correctly, this means that 'dpmf_always' was a dpmf constant tied across all components (labels). Does this mean that if GMTK wanted to split/vanish Gaussians, it would have had to split/vanish that one dpmf constant, resulting in all components splitting/vanishing? And it likely never obtained the confidence to do so and thus this issue only appeared now that we have separate dpmf constants/tables for every mixture?:

#!C

DPMF_IN_FILE inline
8

0 dpmf_seg0_subseg0_testtrack1 1 DirichletTable dirichlet_num_mix_components  1.0
1 dpmf_seg0_subseg0_testtrack2 1 DirichletTable dirichlet_num_mix_components  1.0
2 dpmf_seg1_subseg0_testtrack1 1 DirichletTable dirichlet_num_mix_components  1.0
3 dpmf_seg1_subseg0_testtrack2 1 DirichletTable dirichlet_num_mix_components  1.0
4 dpmf_seg2_subseg0_testtrack1 1 DirichletTable dirichlet_num_mix_components  1.0
5 dpmf_seg2_subseg0_testtrack2 1 DirichletTable dirichlet_num_mix_components  1.0
6 dpmf_seg3_subseg0_testtrack1 1 DirichletTable dirichlet_num_mix_components  1.0
7 dpmf_seg3_subseg0_testtrack2 1 DirichletTable dirichlet_num_mix_components  1.0

MX_IN_FILE inline
8

0 1 mx_seg0_subseg0_testtrack1 1 dpmf_seg0_subseg0_testtrack1 mc_asinh_norm_seg0_subseg0_testtrack1_component0
1 1 mx_seg0_subseg0_testtrack2 1 dpmf_seg0_subseg0_testtrack2 mc_asinh_norm_seg0_subseg0_testtrack2_component0
2 1 mx_seg1_subseg0_testtrack1 1 dpmf_seg1_subseg0_testtrack1 mc_asinh_norm_seg1_subseg0_testtrack1_component0
3 1 mx_seg1_subseg0_testtrack2 1 dpmf_seg1_subseg0_testtrack2 mc_asinh_norm_seg1_subseg0_testtrack2_component0
4 1 mx_seg2_subseg0_testtrack1 1 dpmf_seg2_subseg0_testtrack1 mc_asinh_norm_seg2_subseg0_testtrack1_component0
5 1 mx_seg2_subseg0_testtrack2 1 dpmf_seg2_subseg0_testtrack2 mc_asinh_norm_seg2_subseg0_testtrack2_component0
6 1 mx_seg3_subseg0_testtrack1 1 dpmf_seg3_subseg0_testtrack1 mc_asinh_norm_seg3_subseg0_testtrack1_component0
7 1 mx_seg3_subseg0_testtrack2 1 dpmf_seg3_subseg0_testtrack2 mc_asinh_norm_seg3_subseg0_testtrack2_component0

Would appreciate if anyone with Segway/GMTK knowledge could weigh in, as I could be totally wrong. Thanks!