Closed GoogleCodeExporter closed 9 years ago
[Philip]
I don't have an obvious answer. I would want a few more use cases for "feature
groups" before we consider making our feature extraction and encoding API even
more
complicated. On the other hand, it seems strange to have the AnnotationHandler
worrying about feature value normalization. This would seem to blur the line
between
extraction and encoding.
What harm is there in giving the AnnotationHandler the ability to normalize a
group
of features that it knows it wants normalized? Does it make it harder to swap
out
one learner for another?
I'm inclined to endorse re-implementation of l2Norm in some place where it is
easy
for AnnotationHandlers to make use of it.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 7:06
[Steve]
Yeah, I wasn't really looking for a new API, just a suggestion on how
best to use our current APIs to work around a problem.
I think for the most part the answer is no. I guess the one thing you'd
lose is the ability to normalize features that are string-valued in the
AnnotationHandler but converted to numeric features by one of the
FeatureEncoders. That seems like an unlikely use case to me - if you
want something normalized, you probably already have it in numeric form.
But maybe I'm just not creative enough in coming up with use cases. ;-)
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 7:10
[Philipp]
From my perspective, this is clearly a problem of feature encoding (i.e. "how
do I
present this feature to the classifier" rather than "how do I get the value of
this
feature"), and as such it should be handled in a feature encoder, not an
annotation
handler.
The solution is pretty simple, I think: in the annotation handler, rather than
just
throwing all of the features into a bag, group them into one more complex
feature.
For example, let's say you want to normalize the bag-of-words features,
ignoring any
other features that also show up in the feature vector. Instead of throwing
10000
individual word features into the list of features generated by the annotation
handler, group those 10000 features into a bag-of-words feature (i.e. write a
simple
BagOfWords class that encapsulates all the words that show up).
Then you customize the feature encoding by adding a feature encoder that
dispatches
on BagOfWords features, does the normalizing, and creates a long list of feature
vector elements. You also disable global normalization in the features encoder.
This sounds complicated, but it requires only three things:
1) write a trivial BagOfWords class and modify the feature extraction to wrap
the
long list of words in an object of that class
2) write a feature encoder for BagOfWords -- this is where the actual
normalization
work is being done
3) write a features encoder factory to use the new feature encoder -- or simply
add
it to the default encoder factory, because it doesn't change the default
behavior
noticably
In my opinion this approach is much better, because it makes good use of the
functionality we already have, it makes decisions about encoding where they
ought to
be made, it's extremely flexible, it's intuitive (once the idea of "feature
extraction" versus "feature encoding" is understood), and it's easy to provide a
default implementation that just works even without that understanding.
Discuss :D
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 7:10
[Steve]
> 1) write a trivial BagOfWords class and modify the feature extraction to
> wrap the long list of words in an object of that class
BagOfWords (which I'm going to refer to as FeatureGroup) would have to
be a subclass of Feature since Instances only have a List<Feature>.
That's a little odd, since it would be nonsensical to ask for the name
or value of a FeatureGroup. You'd probably want to override getName()
and getValue() to throw exceptions just so someone didn't accidentally
treat it like a real Feature.
> 2) write a feature encoder for BagOfWords -- this is where the actual
> normalization work is being done
Actually, that's not true of our current setup - normalization is done
in the FeaturesEncoder, not in the FeatureEncoder.
But this could work by creating a FeatureGroupFeatureEncoder which took
as a constructor parameter a FeaturesEncoder, and by making FeatureGroup
implement Iterable<Feature>. Then when FeatureGroupFeatureEncoder was
asked to encode a FeatureGroup, it would simply call the encodeAll()
method of the FeaturesEncoder.
> 3) write a features encoder factory to use the new feature encoder -- or
> simply add it to the default encoder factory, because it doesn't change
> the default behavior noticably
There's currently no such thing as "the" default encoder factory right
now. We talked about creating one from FileSystemEncoderFactory, but
looking at the code, I'm not entirely sure how that would work - the
various encoder factories all seem to take very different approaches to
initialization.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 7:17
[Philipp]
>> 1) write a trivial BagOfWords class and modify the feature extraction to
>> wrap the long list of words in an object of that class
> BagOfWords (which I'm going to refer to as FeatureGroup) would have to
> be a subclass of Feature since Instances only have a List<Feature>.
> That's a little odd, since it would be nonsensical to ask for the name
> or value of a FeatureGroup.
That would be very odd indeed. What I mean is, instead of having 10000 Features
with
a String value, we have one Feature with a BagOfWords (subclass of Object) value
(containing all the Strings). This is consistent with what we have now: some
features
contain String values, some contain Integer values, some contain Boolean
values. The
reason why we, together, decided to allow this is that we recognized that
feature
encoders might want to handle different kinds of features in different ways; we
also
thought it was important to not restrict the type a value can have to a few
arbitrary
ones, because who knows what special kinds of feature encoding people come up
with.
Your scenario is one kind of special feature encoding that we hadn't thought of
specifically, but that's easily handled by the framework we came up with.
As soon as you want to treat all the features generated by a bag of words as a
unit
of some kind (e.g. by normalizing them in a particular way), the features
aren't just
a collection of individual values with context, they are a hierarchical, complex
value structure (i.e. one bag-of-words feature instead of 10000 string
features).
Thus the annotation handler should pass them on as such to feature encoding. No
change in API and no special cases in any of our existing code is required.
Since you're coming back to the "FeatureGroup" name: I understand that there are
other scenarios where you might want to normalize a sub-group of the features.
But
thinking of it in those terms when writing the annotation handler is bad. The
annotation handler creates / extracts features, it doesn't worry about how they
are
encoded. The reason you want to normalize a specific sub-group of the feature
vector
is not that they're part of an arbitrary group of features that the annotation
handler designated -- the reason is that they're all part of the same bag of
words,
or that they were all created by some other collective feature extractor, or
that
they all have something else in common. The annotation handler has no business
deciding what gets normalized or encoded in a specific way. But if the feature
encoding code lacks information to do the kind of encoding that you need, the
annotation handler needs to expose more of the structure of the extracted
features,
that's all. We _may_ want to have an abstract superclass FeatureGroup (subclass
of
Object) that BagOfWords inherits from, as do other such feature collections
that we
come across.
> 2) write a feature encoder for BagOfWords -- this is where the actual
> normalization work is being done
Actually, that's not true of our current setup - normalization is done
in the FeaturesEncoder, not in the FeatureEncoder.
Yes, obviously that's not how we're doing this now. But our current setup was
also
designed with the thought that you'd normalize the feature vector globally, not
taking into account its internal structure. This approach fails in your
scenario;
attempting to work around that limitation will require a hack.
> 3) write a features encoder factory to use the new feature encoder -- or
> simply add it to the default encoder factory, because it doesn't change
> the default behavior noticably
There's currently no such thing as "the" default encoder factory right
now. We talked about creating one from FileSystemEncoderFactory, but
looking at the code, I'm not entirely sure how that would work - the
various encoder factories all seem to take very different approaches to
initialization.
No, there's no one default encoder factory now, and there never will be,
because it
wouldn't make any sense -- the reason we came up with all of this is that
different
classifiers _require_ different encodings. We _do_, however, have a default
encoder
factory for each classifier (and in some cases, like SVMlight / LIBSVM, they
share a
common superclass). That is what I was talking about.
I realize that I'm always the opposing voice when we're discussing feature
encoding,
and it often looks like I make things more complicated than they are. The
reason I
feel strongly about this is: When we sat together and worked out the feature
encoding
framework we have now, we did a _really_ good job. The way we broke things up
makes a
lot of sense, it's extremely flexible and powerful, at the same time the core
idea is
very simple and easy to understand, and we managed to actually completely
de-couple
feature extraction from classifier choice -- not in an ad-hoc way that's
specific to
a couple of standard scenarios, but in a generalizable and conceptually sound
way.
Attempting to "fix" the system by blurring the boundaries between feature
extraction
and feature encoding that we created will severely weaken what we have. The
resulting
work-arounds are idiosyncractic and don't generalize, but moreover they are no
easier
for the beginner to understand than if we do it RIGHT within the framework we
have --
and they always make it much harder for people who take the time to really
understand
what the framework does, and who are trying to use its power for their purposes.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 7:19
[Steve]
> That would be very odd indeed. What I mean is, instead of having 10000
> Features with a String value, we have one Feature with a BagOfWords
> (subclass of Object) value (containing all the Strings).
Ah. I see. Yeah, that makes sense.
> Since you're coming back to the "FeatureGroup" name: I understand that
> there are other scenarios where you might want to normalize a sub-group
> of the features. But thinking of it in those terms when writing the
> annotation handler is bad. The annotation handler creates / extracts
> features, it doesn't worry about how they are encoded. The reason you
> want to normalize a specific sub-group of the feature vector is not that
> they're part of an arbitrary group of features that the annotation
> handler designated
Actually, that's *exactly* the kind of thing I want to normalize. I want
to be able to specify arbitrary features that are conceptually grouped.
For example, I might want to group together all lexical features or all
syntactic features. And once they're grouped, I might do any number of
things: normalization by group, adding additional weight to one group or
another, etc.
Isn't specifying which features are conceptually part of a unit exactly
the kind of thing that belongs in AnnotationHandler?
> > 2) write a feature encoder for BagOfWords -- this is where the actual
> > normalization work is being done
> >
> > Actually, that's not true of our current setup - normalization is done
> > in the FeaturesEncoder, not in the FeatureEncoder.
> >
> Yes, obviously that's not how we're doing this now. But our current
> setup was also designed with the thought that you'd normalize the
> feature vector globally, not taking into account its internal structure.
> This approach fails in your scenario; attempting to work around that
> limitation will require a hack.
I'm not sure what you're proposing here. Could you elaborate?
> The way we broke things up makes a lot of sense,
Generally.
> it's extremely flexible and powerful,
Absolutely.
> at the same time the core idea is very simple and easy to understand,
I think the fact that we've spent so much time debating how to do things
proves that the core idea is *not* simple or easy to understand. I'm not
saying it's wrong. I'm just saying it's not always intuitive, and there
isn't always one obvious way to do things.
> and we managed to actually completely de-couple feature extraction
> from classifier choice
Also a good thing.
> Attempting to "fix" the system by blurring the boundaries between
> feature extraction and feature encoding that we created will severely
> weaken what we have.
I think you're misinterpreting me here. I'm not trying to blur the
boundaries - I just don't see them as clearly as you do.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 7:27
[Philipp]
> Actually, that's *exactly* the kind of thing I want to normalize. I want
> to be able to specify arbitrary features that are conceptually grouped.
> For example, I might want to group together all lexical features or all
> syntactic features. And once they're grouped, I might do any number of
> things: normalization by group, adding additional weight to one group or
> another, etc.
Ok... I'm not sure I see why it would make sense to normalize an arbitrary
subset of
the features that's not naturally grouped (like a bag of words would be), but
the
fact that you're asking for it means I'm wrong about that. I guess I _can_ see
why
for experimentation you'd want to, for example, ignore some features, where the
set
of features to ignore is not easily apparent from the way they are extracted.
Ok.
> Isn't specifying which features are conceptually part of a unit exactly
> the kind of thing that belongs in AnnotationHandler?
Yes, I agree that the annotation handler is the right place to encode the
"structure"
of the features, including which features are conceptually grouped. I had
assumed
that that grouping, where relevant, would always correspond to the way features
are
extracted and could be encoded that way (see the BagOfWords example), but
obviously I
was wrong.
The question is, then: Are the feature groupings that you'll want to use always
strictly hierarchical, never overlapping? I.e., is it impossible for a feature
to
belong to more than one group, for the purpose of feature encoding? If we do not
place such a restriction, things get complicated, and I don't think we can avoid
changes in the API. But if we feel comfortable with keeping such restrictions,
the
solution is simple, and we've already discussed it in this thread:
We introduce a FeatureGroup class (extends Object). A FeatureGroup has a name,
and it
contains a set (list?) of Features, that's all. On the feature encoding side we
introduce a FeatureGroupEncoder (implements FeatureEncoder). The default
FeatureGroupEncoder works recursively like a FeaturesEncoder: it has a list of
FeatureEncoders and simply encodes the Features in the FeatureGroup one by one.
Then
we can introduce other FeatureGroupEncoders that will dispatch only on
FeatureGroups
of a given name and do things like normalization (or parameterize the default
FeatureGroupEncoder to be able to do that or whatever).
Required to get this to work with what we have:
- create the (trivial) FeatureGroup class
- the annotation handler can manually wrap lists of features in FeatureGroups
(giving
the groups names, so they can be identified during feature encoding)
- create the encoder classes mentioned above, and change our default encoder
factories to include a trivial feature group encoder, which effectively
flattens out
all the feature groups
- to customize behavior based on feature groups, write an encoder factory that
includes custom feature group encoders, which dispatch based on the name of a
feature
group; each feature group encoder has its own list of feature encoders,
customizing
the encoding of individual features in that group, and it may also do group-wide
operations such as normalizations
> > Yes, obviously that's not how we're doing this now. But our current
> > setup was also designed with the thought that you'd normalize the
> > feature vector globally, not taking into account its internal structure.
> > This approach fails in your scenario; attempting to work around that
> > limitation will require a hack.
> I'm not sure what you're proposing here. Could you elaborate?
I'm saying: The reason I implemented normalization in the FeaturesEncoder was
that I
only intended normalization to be done on an entire feature vector, treating all
elements the same (I admit that that was pretty short-sighted of me).
Normalization
of subsets of a feature vector should NOT be done in a FeaturesEncoder. We
already
have functionality in place that handles special encodings of individual
features:
the FeatureEncoders. So, in order to get normalization working on a subset of
the
features, we should encapsulate that subset in one feature (e.g. the above
mentioned
FeatureGroup object). This allows us to include a FeatureEncoder that
dispatches only
on Features that have a FeatureGroup value (and then possibly only if the group
has a
certain name); such a FeatureEncoder can then do its own normalization, which
would
normalize all the features _in that feature group_, independent from the rest.
> I think you're misinterpreting me here. I'm not trying to blur the
> boundaries - I just don't see them as clearly as you do.
I didn't mean to imply that you intended to do that, merely that some of the
suggestions that have come up would have that effect. I believe the main
problem (and
the reason that, as you say, this is NOT easy to understand) is that we're still
fuzzy on some of the concepts. That's why it's good we have these discussions.
The main distinction I'm trying to uphold here is the one between feature
extraction
and feature encoding, which to me are two separate things. In my mental model I
place
feature extraction entirely in the domain of annotation handlers, and feature
encoding in, well, the feature encoding code.
Feature extraction is, to me, the process of analyzing the "subject of
analysis" or
SOFA (to use UIMA's terminology), and to identify and collect the presumed
relevant
bits of information, with some limitations on the complexity of those bits of
information (e.g. strings are fine, but an entire parse tree is too complex).
The
considerations influencing this process are, first of all, specific to the task
one
is trying to accomplish; in our field, a lot of the time these will be
linguistic
considerations, or a general intuition about which bits of information are
useful and
which aren't. This can be done without any knowledge about how the bits of
information are used in the end -- the assumption is that the machine learning
system
figures that out, as that is what it's designed to do.
Feature encoding, on the other hand, is not at all concerned with the subject of
analysis. It simply sees a collection of "bits of information" of various types
and
has to bring them into a form that the machine learning system can use. It has
to
struggle with the fact that most machine learning systems can't understand all
types
of information that might arrive; and even if the ML system basically
understands the
information, presenting it in a different way might improve overall performance
(think of presenting an integer number as one numeric SVM feature (123:3) vs. as
multiple binary ones (123:0 124:0 125:1 126:0)). The main consideration going
into
feature encoding is a deep understanding of the exact ML algorithm that's being
used:
e.g. what kind of normalization has which effect, how does the algorithm handle
numeric features when mixed with boolean features, how expensive is it to have a
large number of features, should I give the features long or short names,
what's the
best way to encode a string into a numeric vector? I doubt that most potential
users
of our system have the expertise to make many informed decisions in the context
of
feature encoding, and if we just make sure to provide the most useful default
configuration they'd do best to leave it alone. I recognize of course that
people
will want to experiment with it anyway, even though it may be blind
experimentation.
For me, these two are conceptually AND practically distinct. Certainly some
things
require simultaneous changes to both, but to me it's usually pretty clear what
functionality should go where. Am I alone in this, and does this clear
distinction
not make sense?
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 7:45
[Steve]
On 2/16/2009 11:44 PM, Philipp Wetzler wrote:
[snip description of FeatureGroup, FeatureGroupEncoder, etc.]
This sounds basically fine, but I don't think we need to put it into
ClearTK right now. I'm the only one who currently needs it, and it's
certainly not a feature for a basic user. I recommend that we let me
implement the functionality in my own code, use it for a while, and then
at some later point we discuss whether or not to add it to ClearTK.
> Feature extraction [...] can be done without any knowledge about how
> the bits of information are used in the end -- the assumption is that
> the machine learning system figures that out
> [...]
> The main consideration going into feature encoding is a deep
> understanding of the exact ML algorithm that's being used
This basically sounds like "feature extraction is task dependent and
classifier independent" and "feature encoding is classifier dependent
and (maybe) task independent". Is that right?
> For me, these two are conceptually AND practically distinct. Certainly
> some things require simultaneous changes to both, but to me it's usually
> pretty clear what functionality should go where. Am I alone in this, and
> does this clear distinction not make sense?
Well, I can't speak for 2PO, but I certainly wouldn't say that it's been
clear to me which functionality should go where. Consider the following
two examples of classifier-independent things you might want to do:
(1) Applying a Euclidean norm to feature vectors. This is pretty much
the standard for a TF-IDF document representation, regardless of what
classifier you plan to give that representation to.
(2) Making the training and testing data from two different runs
compatible such that the model trained on the training data can be
tested on the testing data (e.g. the feature names/indices match, etc.)
Both of these things should work for any classifier, so I consider them
classifier-independent. But they're both currently handled in the
feature encoding layer. Why?
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 7:48
> This basically sounds like "feature extraction is task dependent and
> classifier independent" and "feature encoding is classifier dependent
> and (maybe) task independent". Is that right?
Effectively, that's how it seems to work out.
> [...] Consider the following
> two examples of classifier-independent things you might want to do:
>
> (1) Applying a Euclidean norm to feature vectors. This is pretty much
> the standard for a TF-IDF document representation, regardless of what
> classifier you plan to give that representation to.
I believe you when you say that that's the standard thing people do, even for
classifiers that don't profit from it -- but that doesn't mean it makes any
sense
whatsoever. When I consider how to take a list of TF-IDF values and put them
into an
SVM training data file, it makes a lot of sense to consider normalization
schemes,
because they _will_ make an immediate and predictable difference (assuming that
I
know the SVM implementation well enough). I'm curious what justification people
have
for normalizing their features without taking the classifier into consideration
--
I'd really like to know, because I imagine there is a reason that I'm simply not
aware of.
So, going by my current understanding, this kind of normalization only becomes
meaningful in the context of a specific classifier. That doesn't mean you can't
do it
on _every_ classifier, but to make an informed decision about using this
normalization you look at the classifier, not the task. So unless there is
another
reason to do normalization, this would be feature encoding.
If there is another reason, of course, and the normalization is NOT being done
for
the sake of the classifier, then that normalization should be done during
feature
extraction.
> (2) Making the training and testing data from two different runs
> compatible such that the model trained on the training data can be
> tested on the testing data (e.g. the feature names/indices match, etc.)
>
> Both of these things should work for any classifier, so I consider them
> classifier-independent. But they're both currently handled in the
> feature encoding layer. Why?
Actually, (2) is _not_ necessary for every classifier -- there are various
classifiers that do their own mapping (i.e. the training data we generate
simply uses
names instead of indices), right? So that alone settles it. Reason 2: the
mapping
requires knowing how features are encoded (e.g. numbers: one feature index or
many?
it affects the mapping). How features are encoded is definitely classifier
dependent
(meaning an informed decision takes the classifier into account). Reason 3: the
only
reason we care about having feature indices is that because of the way ML
algorithms
work most training data formats require us to use them -- we only have them to
accommodate classifier limitations.
Looking at the explanations I wrote I guess I'd say: task-dependent vs.
classifier-dependent, yes. But dependent not in the sense of "I can't do the
same for
different classifiers (or tasks)", but "to make an _informed_ choice about how
to do
it, I need to primarily consider the classifier (or task)". Of course a user can
still pick an arbitrary normalization scheme because they read in that one
paper that
"they normalized the data, and it improved accuracy" (never mind that their
whole
setup was completely different). Users can do that, wherever we put that
functionality. But if we're not careful about where things should go it makes
life
difficult for users who know what they're doing and want to customize the
system to
do what they _know_ will work.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:00
[Steve]
> >
> > This basically sounds like "feature extraction is task dependent and
> > classifier independent" and "feature encoding is classifier dependent
> > and (maybe) task independent". Is that right?
> >
> Effectively, that's how it seems to work out.
Well that's a good rule of thumb that we should document somewhere.
> > [...] Consider the following
> > two examples of classifier-independent things you might want to do:
> > (1) Applying a Euclidean norm to feature vectors. This is pretty much
> > the standard for a TF-IDF document representation, regardless of what
> > classifier you plan to give that representation to.
> >
> I believe you when you say that that's the standard thing people do,
> even for classifiers that don't profit from it -- but that doesn't mean
> it makes any sense whatsoever.
"Finally, ltc weighting handles differences in document length by cosine
normalizing the feature vectors (normalizing them to have a Euclidean
norm of 1.0)."
-- David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li
RCV1: A New Benchmark Collection for Text Categorization Research
They don't mention anything about specific classifiers here, and it
sounds like a task-based reasoning. But more importantly, I suspect they
do this because it's what everyone else has done in the past, and they
want to be able to compare results.
I think trying to stop people from doing normalization for any
classifier they want is a *big* mistake. ClearTK should make whatever it
has available, and let people mix-and-match as they like, regardless of
whether or not *we* think it makes sense.
<digression>
I feel pretty strongly about this, given my experiences distributing the
argparse Python library. Argparse started as an extension of optparse,
and optparse makes claims like:
"Some other option syntaxes that the world has seen include...
"-pf"..."-file"..."+rgb"..."/file"... These option syntaxes are not
supported by optparse, and they never will be. This is deliberate: the
first three are non-standard on any environment...
This is foolish. Just because you don't like a particular thing is no
justification to keep other people from doing it. Give them some credit
- they probably have their reasons. For example, some of my argparse
users explained that they had to maintain backwards compatibility with
an existing command line interface. With argparse, they can do that
because it doesn't tell them how to design their own command lines. With
optparse, they can't.
</digression>
For ClearTK, I argue that people may have their own reasons for
normalizing for any classifier, and we shouldn't keep them from doing
that just because we think it's the wrong thing to do. Give them some
credit - they probably have their reasons.
In general, I think that anything that *can* be done for any classifier,
regardless of whether or not we think it *should* be done, should be
available in the feature extraction layer.
> > (2) Making the training and testing data from two different runs
> > compatible such that the model trained on the training data can be
> > tested on the testing data (e.g. the feature names/indices match, etc.)
> >
> Actually, (2) is _not_ necessary for every classifier
That's not right. Making compatible training and testing data *is*
necessary for every classifier. True, for some classifiers there's some
code to do this, and for some classifiers it's a no-op. But the task of
generate matching training and testing data is common to all classifiers.
Let me ask the question a different way. Right now, the creation of
encoders is done in DataWriter_ImplBase, and it has a couple of options:
(1) If an EncoderFactory is specified it creates that object
(2) If an EncoderFactory is not specified it creates the default object
Why it doesn't make sense to add a third:
(3) If (somehow) requested, it loads the object from a file
All of these tasks are "get me some encoders" tasks. Why do the first
two belong in DataWriter_ImplBase, but the third belongs in an
EncoderFactory?
Steve
P.S. Remember that part of the point of this discussion is to explain
why the boundaries aren't as clear for others as they are for you. Can
you at least see why there's some confusion as to what goes where?
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:06
[Philipp]
> They don't mention anything about specific classifiers here, and it
> sounds like a task-based reasoning. But more importantly, I suspect they
> do this because it's what everyone else has done in the past, and they
> want to be able to compare results.
>
> I think trying to stop people from doing normalization for any
> classifier they want is a *big* mistake. ClearTK should make whatever it
> has available, and let people mix-and-match as they like, regardless of
> whether or not *we* think it makes sense.
As I _tried_ to say in my response, I _do not_ advocate keeping people from
doing
whatever they want. And the quote you've given above only says that normalizing
compensates for document length, not why document length would be an issue
otherwise.
I'm actually interpreting this to be classifier-based reasoning (I know it
would be
an issue for some _classifiers_, but I'm also pretty sure there are some where
it
wouldn't), but the quote doesn't actually say.
I am in no way saying we shouldn't let people mix and match all the
functionality we
have, in whatever way they like -- as you say, I'm sure they have their reason.
I'm
not saying we should make it difficult, either. I'm just saying we should
structure
our code so that functionality that's necessary in order to accommodate
classifiers
is kept on one side, whereas functionality that arises from the task itself,
ignoring
the classifier, is kept on the other.
> In general, I think that anything that *can* be done for any classifier,
> regardless of whether or not we think it *should* be done, should be
> available in the feature extraction layer.
There's very little that _can't_ be done for every classifier. If we follow this
rule, we might as well scrap the whole feature encoding layer and pack it all
into
feature extraction. The resulting output of feature extraction will,
technically, be
usable with any classifier, but actually it will be necessary to hand-optimize
feature extraction for different classifiers to make best use of their
capabilities.
In our current model, when switching the classifier the feature extraction code
can
be left alone.
E.g., for pretty much all classifiers you *can* l2-normalize the features, so
let's
say we put that functionality into feature extraction, and because I'm using
SVMlight
and l2-normalization helps with that I'll turn it on. But then I'm switching to
a
different SVM implementation, and the documentation explains that, due to the
different algorithm they use performance will be better if I scale all features
to
within [0, 1]. I _could_ just ignore that advice, because, after all, I *can*
still
use l2-normalization. But in practice, because I care about getting good
performance,
I'll have to change feature extraction to accommodate a new classifier.
I'm just going to quickly mention the possibility of a classifier that only
takes
boolean features, not numeric ones. With our current split that can be
accommodated
easily.
I don't understand why this is any problem at all. No one is advocating stopping
people from doing anything. If anything, I'm advocating a framework that
encourages
people to evaluate their choices in the proper context -- without forcing them
to do so.
> Let me ask the question a different way. Right now, the creation of
> encoders is done in DataWriter_ImplBase, and it has a couple of options:
>
> (1) If an EncoderFactory is specified it creates that object
> (2) If an EncoderFactory is not specified it creates the default object
>
> Why it doesn't make sense to add a third:
>
> (3) If (somehow) requested, it loads the object from a file
>
> All of these tasks are "get me some encoders" tasks. Why do the first
> two belong in DataWriter_ImplBase, but the third belongs in an
> EncoderFactory?
As I have said before, I do not think that (2) belongs in DataWriter_ImplBase;
consequently (3) doesn't either. I suggested before to remove (2) and instead
always
give a default factory in our descriptor files (see our last big thread on this
subject). DataWriter_ImplBase shouldn't be concerned with how (or where) an
encoder
is created, it should delegate that task to a factory.
> P.S. Remember that part of the point of this discussion is to explain
> why the boundaries aren't as clear for others as they are for you. Can
> you at least see why there's some confusion as to what goes where?
I can see *that* there is confusion. I'm really trying to, but even with all
those
examples I honestly don't seem to get *why* -- not sure what that says. I also
don't
get why it is an issue. Do you consider it problematic that, in order to get
their
desired behavior, people will need to make some changes to feature encoding, in
addition to whatever they're doing in feature extraction, instead of having to
do the
same amount of work all in feature extraction? Even at the cost of giving up
(at the
least) some degree of classifier transparency?
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:12
[Steve]
> I'm just saying we should structure our code so that
> functionality that's necessary in order to accommodate classifiers is
> kept on one side, whereas functionality that arises from the task
> itself, ignoring the classifier, is kept on the other.
I'm still unable in practice to make the task/classifier distinction in
the same way you do. If I'm doing a task where the standard
representation is TF-IDF with Euclidean normalization, I think of that
as part of the task because it's part of the representation of the
feature space [1]. But you think of it as part of the classifier (I
gather) because it may be more or less effective depending on the
classifier.
[1] Note that to me this is different from the SVM having to encode
feature names as numbers. I can just as easily normalize while the
feature names are still strings.
> > P.S. Remember that part of the point of this discussion is to explain
> > why the boundaries aren't as clear for others as they are for you. Can
> > you at least see why there's some confusion as to what goes where?
> >
> I can see *that* there is confusion. I'm really trying to, but even with
> all those examples I honestly don't seem to get *why* -- not sure what
> that says. I also don't get why it is an issue.
It's an issue because, as a user of ClearTK, I don't know where best to
put things. This discussion started because there were two approaches to
implementing the kind of feature groupings I needed: normalization
during feature extraction, and normalization during feature encoding.
Both would achieve my goals equally well, and both seem to be about the
same amount of code. My first intuition is to do it during feature
extraction because it's part of the task representation, but your first
intuition is to do it during feature encoding.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:15
> I'm still unable in practice to make the task/classifier distinction in
> the same way you do. If I'm doing a task where the standard
> representation is TF-IDF with Euclidean normalization, I think of that
> as part of the task because it's part of the representation of the
> feature space [1]. But you think of it as part of the classifier (I
> gather) because it may be more or less effective depending on the
> classifier.
Maybe I should define what I mean by "task": Let's say you're doing document
classification, for example by topic. When I say "task", I mean just that:
deciding
if document X is topic A or topic B. I'm not talking about what people usually
do for
this kind of problem; I'm not talking about a "task" at a conference that gives
you
specific framing conditions; I'm not talking about reproducing the approach of
someone else; I'm not talking about using a specific "feature space". All of
those
things are important, but they're not part of what I call the "task".
But for document classification, certain pieces of information are known to be
useful
independent of all external framing conditions. The presence of certain words is
known to be a useful bit of information. It's also known that the frequency of
a word
in the document divided by its frequency in the corpus is useful (and it's a
distinct
piece of information). These are useful, because they carry information about
the
topic, they are task-specific. Multiplying the number I use to represent that
information by 0.5 does NOT give me any more information about the topic.
Certainly
many researchers also used normalization schemes on the resulting data, and it
improved their performance. But that improvement is not because "normalization
is a
good thing to do for document classification", but because "normalization is a
good
thing to do for many ML algorithms". Can you see that at all?
Of course people will want to reproduce what other researchers did, or do what's
considered good practice. So they can do that by customizing feature encoding
along
with feature extraction. The "representation of the feature space" is a result
of
both of them, combined.
> It's an issue because, as a user of ClearTK, I don't know where best to
> put things. This discussion started because there were two approaches to
> implementing the kind of feature groupings I needed: normalization
> during feature extraction, and normalization during feature encoding.
> Both would achieve my goals equally well, and both seem to be about the
> same amount of code. My first intuition is to do it during feature
> extraction because it's part of the task representation, but your first
> intuition is to do it during feature encoding.
Ok, yes, that is an issue. I'm not sure how to deal with it.
Obviously I have a different background from you two. I started out in ML, and I
first applied it to a completely separate kind of problem (computer vision)
before
coming to NLP. I guess it's not surprising that my mind would break down the
problem
in a different way, and it seems clear by now that I'm unable to explain that
way to
you. On the other hand, I can't let it go, because it's clear to me that cleanly
breaking things into extraction and encoding is much, much better, and will make
things far easier in the future; I do not want to go back to the old way. So to
summarize, I don't know what to do about it.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:19
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:20
> But for document classification, certain pieces of information are known
> to be useful independent of all external framing conditions. The
> presence of certain words is known to be a useful bit of information.
> It's also known that the frequency of a word in the document divided by
> its frequency in the corpus is useful (and it's a distinct piece of
> information).
It's also known that it's important to account for differences in
document length. (Hence the normalization.)
Why is document length a classifier specific thing?
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:23
[Philipp]
Can you give me any reason why accounting for document length is important (I
mean a
detailed explanation, showing how the way you account for it affects the final
outcome) that does NOT involve an ML classifier?
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:24
[Steve]
Probably no better than anyone can explain a theoretical motivation for
TF-IDF. But here goes:
A word occurring once in a 10 word document is more important than a
word occurring once an a 100 word document because in the 10 word
document, it makes up a larger part of the document content.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:25
[Philipp]
So why, instead of normalization, don't you just include another feature that
says
"this document is 100 words long"? It would certainly be easier, and the
information
content is the same (or even higher).
AFAIK people choose to do normalization instead because that way ML systems are
much
less easily confused.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:27
[Steve]
Sure, there's a hundred different ways to encode any feature. Why use
TF-IDF? Why not use a TF feature and an IDF feature for every word? "The
information content is the same (or even higher)"
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:27
[Philipp]
That's correct. Our TF-IDF extractor does mix concepts and is a bit classifier
specific. I've thought about trying to rewrite it, but haven't had time to
really
think it through yet.
Yes, there are a hundred different ways to encode any feature. And that's
exactly why
feature encoding shouldn't be done together with feature extraction. Feature
extraction is about gathering information, not about how to represent it. The
"classifier dependent / classifier independent" distinction only arises from
that
because classifiers tend to be very picky about the way a feature is _encoded_,
while
they don't care at all what _information_ a feature carries.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:28
[Steve]
Some other things that should be feature encodings under this logic:
* SyntacticPathExtractor - converting the parts of the path into a
"XX::YY;;XX" string with is representing information, not gathering it
* SubCategorizationExtractor - combining the parent and child nodes into
a "XX -> YY ZZ" string is representing information, not gathering it
* NGramExtractor - joining the pieces of the ngram into a "xx|yy|zz"
string is representing information, not gathering it
Is this really what you're proposing?
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:29
[Philipp]
I did always feel it was awkward to flatten out that information -- what if
someone
came up with a classifier that could generalize over _parts_ of a syntactic path
(i.e. you give the classifier two paths, and the classifier sees "the first
three
path elements are the same" and generalizes over that). There are even now
things
like SVMstruct, and custom kernels and such, so it is conceivable for such a
classifier to exist.
So strictly, yes, that is what I'm proposing. Those extractors should create a
feature with a custom value that encapsulates the parts, and a feature encoder
should
take that value and encode it in such a way that the classifier can use it.
Now, I'm not saying we can't cheat a little bit, especially if it's in isolated
cases
(special purpose extractors that aren't used everywhere). But complex
extractors that
are used in lots of places, or functionality that is universal (like
normalization
schemes) should be done right.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:31
[Steve]
Well, I at least now see where you're going: anything that's just taking
a piece of information from the CAS is feature extraction, doing
anything at all with that information is feature encoding.
That said, I'm probably always going to "cheat" in my own code and put
most functionality into feature extractors because there's only one
class to implement instead of three, and when you're done it works for
all classifiers instead of just one. But I'm fine with keeping my more
practical (but less pure) code out of ClearTK.
Steve
P.S. I think my current plan will probably be to create a
EuclideanNormExtractor which takes as constructor parameters other
SimpleFeatureExtractors. When .extract() is called, it will collect all
their Features, apply normalization and return the resulting
List<Feature>. This way, I can group features arbitrarily for
normalization by simply creating more than one EuclideanNormExtractor.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 8:56
[Philipp]
It's really one class instead of two (feature extractor / feature encoder) plus
adding one line to the encoder factory. And since feature encoders can be used
for
not only one type of classifier, the result does work with most classifiers (and
where it doesn't it's trivial to get it to work to the extent that your approach
does). With eclipse's help in writing Java boilerplate for you, you end up
writing
the same amount of code in either case. They're both equally practical in that
sense.
It would be helpful at this point to have 2P's input. It appeared before,
however,
that his perspective was similar to yours. That being the case, it makes much
more
sense for you two to structure feature extraction / feature encoding / whatever
you
want to call it the way that seems right to you. I can maintain my own set of
changes
to implement it the way I prefer it.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 9:10
[Steve's summary]
Steve's view: feature extractors are for classifier independent code
--------------------------------------------------------------------
Anything involving features that is classifier independent belongs in
the feature extraction layer. Things that are classifier dependent (e.g.
the string to number conversions of SVMs) belong in the feature encoding
layer.
Example: Converting a syntactic path to "NP::S;;VP" is classifier
independent, so it belongs in feature extraction.
Example: Euclidean normalization can be applied to features for any type
of classifier, so it belongs in feature extraction.
Feature extractors are easier to create and use because you only need to
create a single class (e.g. EuclideanNormalizationFeatureExtractor) and
use it in your AnnotationHandler.
Feature extractors also have the advantage of working for any classifier.
3P's view: feature extractors are only for selecting pieces of the CAS
----------------------------------------------------------------------
The only thing that feature extractors should do is look at the CAS and
select pieces of it. Anything that modifies, combines, etc. the pieces
of the CAS belongs in the feature encoding layer.
Example: Extracting a path of NP, S and VP nodes from the CAS belongs in
feature extraction, but converting those objects to the string
"NP::S;;VP" is a representation issue so it belongs in feature encoding.
Example: Euclidean normalization is a transformation of information
extracted from the CAS, so it belongs in feature encoding.
Feature encoders are easy enough to use. You just need to create a new
feature encoder class (e.g. EuclideanNormalizationFeatureEncoder),
create a new encoder factory class which inherits from an existing
encoder factory (e.g. SVMEncoderFactory) and adds a single call to
addEncoder(), and then specify your new encoder factory using the
"EncoderFactoryClass" parameter to DataWriter_ImplBase.
Feature encoders aren't totally classifier independent, but in many
cases, your code would work for multiple classifiers (e.g. all SVMs, and
more if we can merge ContextValue and FeatureVector).
Steve
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 9:11
[Philipp's summary]
There's no misrepresentation, I'm just going to rephrase a bit where I think the
terminology is unclear. For one thing, let's not use the "classifier dependent"
/
"classifier independent" terms, because we both use them in different ways. I'm
trying to be fair in representing both sides, let me know if you disagree with
the
way I'm phrasing things.
Steve's view
-------------------
Anything involving features that can in principal be applied to all classifiers
belongs in the feature extraction layer. Things that only apply to specific
classifiers (and can't reasonably be applied to others), such as the string to
number
conversions of SVMs, belong in the feature encoding layer.
Example: Creating a syntactic path feature such as the string "NP::S;;VP" can
be done
(and is potentially useful) for any classifier, so it belongs in feature
extraction.
Example: Euclidian normalization can be applied to features for any type of
classifier, so it belongs in feature extraction.
Feature extractors are easier to create and can be immediately applied to any
classifier. In exchange they commit to one specific representation of the
feature,
which may not give best results with all classifiers, and which can only be
optimized
to a different classifier by changing the code in the feature extractor.
Philipp's view:
---------------------
Anything involving features that is potentially affected by the choice of
classifier
should go into feature encoding. Things that *can* be applied to any
classifier, but
have potentially different effects, should also go into feature encoding. Only
things
that are not related to classifier choice in any way should go into feature
extraction.
Example: Extracting a path of NP, S and VP nodes from the CAS belongs in feature
extraction, but converting those objects to the string "NP::S;;VP" is only one
possible representation; some classifiers may allow a more powerful
representation,
so the choice to create that string should be made in feature encoding.
Example: Euclidian normalization is a transformation of information extracted
from
the CAS; there is an infinite number of such transformations that could
conceivably
be applied, and the choice of classifier dictates which ones promise good
results and
which ones don't. Thus it belongs in feature encoding.
Creating a new kind of feature extractor in this model requires a bit more
work. The
feature extractor itself is much simpler. But you also create a new feature
encoder
class (e.g. EuclidianNormalizationFeatureEncoder), which does the main work.
Then you
modify your encoder factory (or subclass a default one, if you're not using a
custom
one yet) and add a single call to addEncoder() with the new encoder as
argument. The
factory class is passed to the DataWriter as a parameter as always.
This does not automatically let you use the feature extractor for all
classifiers. To
make it work with another classifier, you might have to subclass another encoder
factory. If the classifier works in a very different way, you might have to
write
another feature encoder. In exchange the user of this feature extractor can use
it in
their annotation handlers with no consideration to the type of classifier used.
When
switching to a new classifier, it's always possible to achieve optimum
performance
with that classifier by only customizing feature encoding, not the annotation
handler. It's also easier to experiment with different ways of representing a
feature
by using different feature encoders.
Original comment by pvogren@gmail.com
on 18 Feb 2009 at 9:16
[Philip Ogren]
I can see valid points on both sides of the argument. However, I think that
Philipp
has made a clearer case for his approach. Let me start by going through our
working
examples:
- syntactic path example. For one, it is no extra work for the encoder for it
to
receive a Feature whose value was a syntactic path object and do the default
thing
which is to convert it to a string - presumably the syntactic path object knows
how
to do this for the encoder anyways. For two, doing this sort of thing
complicates
feature proliferation - if there are "sub-features" to be had from the
syntactic path
then getting them out of a string representation is a pain (but this is an
aside).
For three, suppose there is some svm kernel that can really take advantage of
structured features - why make the annotation handler worry about this? I would
argue that we did the syntactic path features wrong. If you look at the
WindowFeature - we don't just automatically create a name/value pair where the
value
encodes all of the pertinent information - we actually pass a value that has
all of
the information explicitly represented.
- Euclidean normalization - I think Philipp's arguments are more compelling
here and
I like his three step solution for accomplishing what you need. I think it is
out of
place for the AnnotationHandler to be deciding which normalization technique to
use -
let the encoder factory set this up.
- binning values (my example) - When I use maxent - since it doesn't really
handle
numeric values in the same way that svm's do it is convenient to bin them (e.g.
high,
medium, low). Why should the annotation handler care which classifier is being
used
and how best to bin feature values - it shouldn't.
- tf/idf - Philipp and I discussed this and decided that it makes sense for
annotation handler to count up term frequencies and that's it. IDF values are
going
to come from some precomputed value and they can be used just as easily in the
feature encoder as in the annotation handler. And aren't there like 15 ways to
calculate TF/IDF? Of course, how the various ways calculating TF/IDF should be
abstracted out - but it seems to me that deciding how term frequency
information is
presented to the classifier is a job for the encoder not the extractor.
Architecturally - I think Philipp's proposal is the right one and we should go
down
that route. The distinction between feature extraction and feature encoding is
clear
and it will be a much more powerful and flexible approach. One of my mental
hangups
is that I have this nagging intuition that Steve's approach "would just be
easier".
One of my hangups is getting used to the idea of having many different feature
encoding scenarios for a particular annotation handler. After all, we started
off
with no feature encoding assuming that once we had the features - it was just a
matter of creating the right file format. However, I think that when we get
used to
writing EncoderFactories to go with different annotation handler / classifier
combinations - this will all start to feel quite natural. Solutions for how to
make
expected behavior the default or easy-to-use will become obvious I think. I
don't
think that Philipp's assessment that creating new feature encoders is going to
be
harder than creating new feature extractors is correct. In most cases we can
treat
the encoder factory as similar to a configuration file and a new factory class
will
be the only thing required of a developer for a new feature encoding strategy
in most
cases.
Original comment by pvogren@gmail.com
on 19 Feb 2009 at 12:52
Well, 2 against 1, so that settles it. In ClearTK, feature extractors will only
select pieces of the CAS, and will never combine or transform these in any way
(e.g.
they will never format objects as strings or apply normalization to feature
values).
One of you should put together a tutorial/explanation of what the different
layers
are for, and what should go where. It would be good to warn people that for most
complex tasks, they'll end up writing both an AnnotationHandler and an
EncoderFactory. That's not unreasonable - "AnnotationHandler" and
"EncoderFactoryClass" are both parameters for DataWriters.
I personally don't like having to write and synchronize two classes for every
task I
do. So in my own code, I'm probably going to continue to do everything at the
feature
extraction level. But I'll make sure to keep that code out of ClearTK.
Feel free to rewrite my TF/IDF code to move parts into the encoding layer. It's
not
really clear to me how you'd do that, so I'll be interested to see what you
guys come
up with.
Original comment by steven.b...@gmail.com
on 19 Feb 2009 at 1:10
too bad we didn't have the dev list running before this thread was started.
duh! I
think I will post a note the the list just so that this thread is searchable
from the
list archive.
Philipp has opened a real issue related to the first posts of this issue at #74.
Original comment by pvogren@gmail.com
on 18 Mar 2009 at 10:06
Original issue reported on code.google.com by
pvogren@gmail.com
on 18 Feb 2009 at 5:54