change constants in DocumentClassificationAnnotator to differentiate feature extractors

From the mailing list:

I was looking at the 
org.cleartk.examples.documentclassification.advanced.DocumentClassificationAnnot
ator code and I notice two different feature extractor keys with duplicate 
value in lines 120 and 122:

  public static final String ZMUS_EXTRACTOR_KEY = "LengthFeatures";

  public static final String MINMAX_EXTRACTOR_KEY = "LengthFeatures";

Lee's response:

After looking at the code, I think this may actually be a bug.  While this code 
is functionally correct, it is only doing so by virtue of a side-effect of how 
the stats are computed and how feature encoding is handled.  During training 
all trainable feature extractors save off a TransformableFeature which has a 
name and a collection of features extracted by their respective sub-extractors. 
 Because the ZmusExtractor and MinMaxExtractor have the same sub-extractors 
(token and sentence counters), and because they have the same name ("Length"), 
we end up with two Transformable features with identical names and identical 
sets of sub-features.  Then when it comes time to train these trainable feature 
extractors (i.e. compute the z-score and min/max statistics), I believe the 
extractors actually on each of the transformable features named "Length".  For 
other statistics this might cause a problem, but in the case of mean, standard 
deviation, min and max the values are unaffected.

I would suggest we find a way to extract this "Length" Transformable feature 
only once, but that would require extra management of which feature extractors 
are called during training and test of the document classifier.  So the 
takeaway is that this works by coincidence, and the right thing to do is to 
change the values of these constants to differentiate these feature extractors.

Original issue reported on code.google.com by steven.b...@gmail.com on 29 Mar 2013 at 12:26

laito / cleartk

change constants in DocumentClassificationAnnotator to differentiate feature extractors #356