As discussed on the mailing list, different feature encoders do different
things when encountering duplicate features:
https://groups.google.com/d/topic/cleartk-users/B2cfZSUX7W0/discussion
For example, FeatureVectorFeaturesEncoder adds together the counts for
identical feature names,
NameNumberFeaturesEncoder produces duplicate NameNumber pairs, and
FeatureNodeArrayEncoder throws away all but the last value.
All the feature encoders should do the same thing. A few options:
* Add values together, as in, FeatureVectorFeaturesEncoder, though this doesn't
make much sense for Boolean valued features
* Throw an exception, requiring the annotator to de-duplicate. This might be
conceptually the simplest thing to do, but might require substantially more
work from the annotator.
In addition to true duplicates, we also need to figure out what we should do
when two features with the same name but *different* values are given.
Original issue reported on code.google.com by steven.b...@gmail.com on 1 Mar 2013 at 9:24
Original issue reported on code.google.com by
steven.b...@gmail.com
on 1 Mar 2013 at 9:24