hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Soft labeling for Semi-supervised mode #38

Closed EricR86 closed 9 years ago

EricR86 commented 9 years ago

Original report (BitBucket issue) by Sakura Tamaki (Bitbucket: Tamaki_Sakura).


Currently the Segway's semi-supervised work in the way such that during training, for a point with semi-supervised label, the whole probability will vanish unless the segments label at that point is equal to the semi-supervised label.

We should add a parameter such that when enabled, the probability will vanish unless the segments label at that point is "close" to the semi-supervised label.

For instance, if the semi-supervised label is 0 at chr1:20000, in training we will allow chr1:20000 have either label 0,1,2,3,4 to have the same positive probability.

EricR86 commented 9 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


The best way to specify this would be in the supervision BED file, so instead of having 0 as the name, we could have 0:5 to specify anything in the range [0,5) == 0,1,2,3,4.

EricR86 commented 9 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


EricR86 commented 9 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


Current code is in segway.run.Runner.load_supervision().

EricR86 commented 9 years ago

Original comment by Sakura Tamaki (Bitbucket: Tamaki_Sakura).


Oh great time to enjoy an if statement without the keyword if :(

Is there any recommended reference for GMTK? The official Manuel's reference part is empty, the toolkit overview in the Manuel is a great introduction but does not includes all the features I want.

Basically I need to find out two thing

  1. Does GMTK support more complicated data structure in parameter? If not we might need to discuss how to implement soft assignment and what should be implemented. In the worst case we need to change our DBN structure.
  2. How to represent logical and operator in Dense CPT file.
EricR86 commented 9 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


There is a decision tree, map_supervisionLabel_seg_alwaysTrue that is invoked by the supervisionLabel_seg_alwaysTrue DeterministicCPT to connect the alwaysTrue and supervisionLabel random variables. See supervision.tmpl to see GMTKL structure of these variables.

The basics of the decision tree syntax are described well in the old GMTK manual. I think the operators are not. But the basic decision tree allows ranges which should make it not too difficult to do what you want to do.

EricR86 commented 9 years ago

Original comment by Sakura Tamaki (Bitbucket: Tamaki_Sakura).


Pull Request #29 Merged

EricR86 commented 9 years ago

Original comment by Sakura Tamaki (Bitbucket: Tamaki_Sakura).