From the mailing list:
I was looking at the
org.cleartk.examples.documentclassification.advanced.DocumentClassificationAnnot
ator code and I notice two different feature extractor keys with duplicate
value in lines 120 and 122:
public static final String ZMUS_EXTRACTOR_KEY = "LengthFeatures";
public static final String MINMAX_EXTRACTOR_KEY = "LengthFeatures";
Lee's response:
After looking at the code, I think this may actually be a bug. While this code
is functionally correct, it is only doing so by virtue of a side-effect of how
the stats are computed and how feature encoding is handled. During training
all trainable feature extractors save off a TransformableFeature which has a
name and a collection of features extracted by their respective sub-extractors.
Because the ZmusExtractor and MinMaxExtractor have the same sub-extractors
(token and sentence counters), and because they have the same name ("Length"),
we end up with two Transformable features with identical names and identical
sets of sub-features. Then when it comes time to train these trainable feature
extractors (i.e. compute the z-score and min/max statistics), I believe the
extractors actually on each of the transformable features named "Length". For
other statistics this might cause a problem, but in the case of mean, standard
deviation, min and max the values are unaffected.
I would suggest we find a way to extract this "Length" Transformable feature
only once, but that would require extra management of which feature extractors
are called during training and test of the document classifier. So the
takeaway is that this works by coincidence, and the right thing to do is to
change the values of these constants to differentiate these feature extractors.
Original issue reported on code.google.com by steven.b...@gmail.com on 29 Mar 2013 at 12:26
Original issue reported on code.google.com by
steven.b...@gmail.com
on 29 Mar 2013 at 12:26