noushadali / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

feature extraction vs. feature encoding #55

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
[Steve]

Consider this scenario:

* I'm working on a document classification task where I want to combine,
say, word-based features with citation-based features.

* I want to apply Euclidean normalization to each group of features,
that is, I want to normalize the word-based features separately from the
citation-based features.

How should I go about this?

- Normalization is currently done in FeaturesEncoders, but
FeaturesEncoders just see an Iterable of Features - they have no concept
of groups.

- I know which features are in which groups in the AnnotationHandler, so
maybe I should do the normalization by hand there? I'd basically have to
re-implement FeatureVector.l2Norm() to work over Feature objects though...

Anyone have any better ideas?

Original issue reported on code.google.com by pvogren@gmail.com on 18 Feb 2009 at 5:54

GoogleCodeExporter commented 9 years ago
[Philip]
I don't have an obvious answer.  I would want a few more use cases for "feature
groups" before we consider making our feature extraction and encoding API even 
more
complicated.  On the other hand, it seems strange to have the AnnotationHandler
worrying about feature value normalization.  This would seem to blur the line 
between
extraction and encoding. 

What harm is there in giving the AnnotationHandler the ability to normalize a 
group
of features that it knows it wants normalized?  Does it make it harder to swap 
out
one learner for another? 

I'm inclined to endorse re-implementation of l2Norm in some place where it is 
easy
for AnnotationHandlers to make use of it.  

Original comment by pvogren@gmail.com on 18 Feb 2009 at 7:06

GoogleCodeExporter commented 9 years ago
[Steve]

Yeah, I wasn't really looking for a new API, just a suggestion on how
best to use our current APIs to work around a problem.

I think for the most part the answer is no. I guess the one thing you'd
lose is the ability to normalize features that are string-valued in the
AnnotationHandler but converted to numeric features by one of the
FeatureEncoders. That seems like an unlikely use case to me - if you
want something normalized, you probably already have it in numeric form.
But maybe I'm just not creative enough in coming up with use cases.  ;-) 

Original comment by pvogren@gmail.com on 18 Feb 2009 at 7:10

GoogleCodeExporter commented 9 years ago
[Philipp]
From my perspective, this is clearly a problem of feature encoding (i.e. "how 
do I
present this feature to the classifier" rather than "how do I get the value of 
this
feature"), and as such it should be handled in a feature encoder, not an 
annotation
handler.

The solution is pretty simple, I think: in the annotation handler, rather than 
just
throwing all of the features into a bag, group them into one more complex 
feature.
For example, let's say you want to normalize the bag-of-words features, 
ignoring any
other features that also show up in the feature vector. Instead of throwing 
10000
individual word features into the list of features generated by the annotation
handler, group those 10000 features into a bag-of-words feature (i.e. write a 
simple
BagOfWords class that encapsulates all the words that show up).

Then you customize the feature encoding by adding a feature encoder that 
dispatches
on BagOfWords features, does the normalizing, and creates a long list of feature
vector elements. You also disable global normalization in the features encoder.

This sounds complicated, but it requires only three things:

1) write a trivial BagOfWords class and modify the feature extraction to wrap 
the
long list of words in an object of that class

2) write a feature encoder for BagOfWords -- this is where the actual 
normalization
work is being done

3) write a features encoder factory to use the new feature encoder -- or simply 
add
it to the default encoder factory, because it doesn't change the default 
behavior
noticably

In my opinion this approach is much better, because it makes good use of the
functionality we already have, it makes decisions about encoding where they 
ought to
be made, it's extremely flexible, it's intuitive (once the idea of "feature
extraction" versus "feature encoding" is understood), and it's easy to provide a
default implementation that just works even without that understanding.

Discuss :D

Original comment by pvogren@gmail.com on 18 Feb 2009 at 7:10

GoogleCodeExporter commented 9 years ago
[Steve]

> 1) write a trivial BagOfWords class and modify the feature extraction to
> wrap the long list of words in an object of that class

BagOfWords (which I'm going to refer to as FeatureGroup) would have to
be a subclass of Feature since Instances only have a List<Feature>.
That's a little odd, since it would be nonsensical to ask for the name
or value of a FeatureGroup. You'd probably want to override getName()
and getValue() to throw exceptions just so someone didn't accidentally
treat it like a real Feature.

> 2) write a feature encoder for BagOfWords -- this is where the actual
> normalization work is being done

Actually, that's not true of our current setup - normalization is done
in the FeaturesEncoder, not in the FeatureEncoder.

But this could work by creating a FeatureGroupFeatureEncoder which took
as a constructor parameter a FeaturesEncoder, and by making FeatureGroup
implement Iterable<Feature>. Then when FeatureGroupFeatureEncoder was
asked to encode a FeatureGroup, it would simply call the encodeAll()
method of the FeaturesEncoder.

> 3) write a features encoder factory to use the new feature encoder -- or
> simply add it to the default encoder factory, because it doesn't change
> the default behavior noticably

There's currently no such thing as "the" default encoder factory right
now. We talked about creating one from FileSystemEncoderFactory, but
looking at the code, I'm not entirely sure how that would work - the
various encoder factories all seem to take very different approaches to
initialization.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 7:17

GoogleCodeExporter commented 9 years ago
[Philipp]
>> 1) write a trivial BagOfWords class and modify the feature extraction to
>> wrap the long list of words in an object of that class

> BagOfWords (which I'm going to refer to as FeatureGroup) would have to
> be a subclass of Feature since Instances only have a List<Feature>.
> That's a little odd, since it would be nonsensical to ask for the name
> or value of a FeatureGroup.

That would be very odd indeed. What I mean is, instead of having 10000 Features 
with
a String value, we have one Feature with a BagOfWords (subclass of Object) value
(containing all the Strings). This is consistent with what we have now: some 
features
contain String values, some contain Integer values, some contain Boolean 
values. The
reason why we, together, decided to allow this is that we recognized that 
feature
encoders might want to handle different kinds of features in different ways; we 
also
thought it was important to not restrict the type a value can have to a few 
arbitrary
ones, because who knows what special kinds of feature encoding people come up 
with.
Your scenario is one kind of special feature encoding that we hadn't thought of
specifically, but that's easily handled by the framework we came up with.

As soon as you want to treat all the features generated by a bag of words as a 
unit
of some kind (e.g. by normalizing them in a particular way), the features 
aren't just
a collection of individual values with context, they are a hierarchical, complex
value structure (i.e. one bag-of-words feature instead of 10000 string 
features).
Thus the annotation handler should pass them on as such to feature encoding. No
change in API and no special cases in any of our existing code is required.

Since you're coming back to the "FeatureGroup" name: I understand that there are
other scenarios where you might want to normalize a sub-group of the features. 
But
thinking of it in those terms when writing the annotation handler is bad. The
annotation handler creates / extracts features, it doesn't worry about how they 
are
encoded. The reason you want to normalize a specific sub-group of the feature 
vector
is not that they're part of an arbitrary group of features that the annotation
handler designated -- the reason is that they're all part of the same bag of 
words,
or that they were all created by some other collective feature extractor, or 
that
they all have something else in common. The annotation handler has no business
deciding what gets normalized or encoded in a specific way. But if the feature
encoding code lacks information to do the kind of encoding that you need, the
annotation handler needs to expose more of the structure of the extracted 
features,
that's all. We _may_ want to have an abstract superclass FeatureGroup (subclass 
of
Object) that BagOfWords inherits from, as do other such feature collections 
that we
come across.

    > 2) write a feature encoder for BagOfWords -- this is where the actual
    > normalization work is being done

    Actually, that's not true of our current setup - normalization is done
    in the FeaturesEncoder, not in the FeatureEncoder.

Yes, obviously that's not how we're doing this now. But our current setup was 
also
designed with the thought that you'd normalize the feature vector globally, not
taking into account its internal structure. This approach fails in your 
scenario;
attempting to work around that limitation will require a hack.

    > 3) write a features encoder factory to use the new feature encoder -- or
    > simply add it to the default encoder factory, because it doesn't change
    > the default behavior noticably

    There's currently no such thing as "the" default encoder factory right
    now. We talked about creating one from FileSystemEncoderFactory, but
    looking at the code, I'm not entirely sure how that would work - the
    various encoder factories all seem to take very different approaches to
    initialization.

No, there's no one default encoder factory now, and there never will be, 
because it
wouldn't make any sense -- the reason we came up with all of this is that 
different
classifiers _require_ different encodings. We _do_, however, have a default 
encoder
factory for each classifier (and in some cases, like SVMlight / LIBSVM, they 
share a
common superclass). That is what I was talking about.

I realize that I'm always the opposing voice when we're discussing feature 
encoding,
and it often looks like I make things more complicated than they are. The 
reason I
feel strongly about this is: When we sat together and worked out the feature 
encoding
framework we have now, we did a _really_ good job. The way we broke things up 
makes a
lot of sense, it's extremely flexible and powerful, at the same time the core 
idea is
very simple and easy to understand, and we managed to actually completely 
de-couple
feature extraction from classifier choice -- not in an ad-hoc way that's 
specific to
a couple of standard scenarios, but in a generalizable and conceptually sound 
way.
Attempting to "fix" the system by blurring the boundaries between feature 
extraction
and feature encoding that we created will severely weaken what we have. The 
resulting
work-arounds are idiosyncractic and don't generalize, but moreover they are no 
easier
for the beginner to understand than if we do it RIGHT within the framework we 
have --
and they always make it much harder for people who take the time to really 
understand
what the framework does, and who are trying to use its power for their purposes.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 7:19

GoogleCodeExporter commented 9 years ago
[Steve]

> That would be very odd indeed. What I mean is, instead of having 10000
> Features with a String value, we have one Feature with a BagOfWords
> (subclass of Object) value (containing all the Strings).

Ah. I see. Yeah, that makes sense.

> Since you're coming back to the "FeatureGroup" name: I understand that
> there are other scenarios where you might want to normalize a sub-group
> of the features. But thinking of it in those terms when writing the
> annotation handler is bad. The annotation handler creates / extracts
> features, it doesn't worry about how they are encoded. The reason you
> want to normalize a specific sub-group of the feature vector is not that
> they're part of an arbitrary group of features that the annotation
> handler designated

Actually, that's *exactly* the kind of thing I want to normalize. I want
to be able to specify arbitrary features that are conceptually grouped.
For example, I might want to group together all lexical features or all
syntactic features. And once they're grouped, I might do any number of
things: normalization by group, adding additional weight to one group or
another, etc.

Isn't specifying which features are conceptually part of a unit exactly
the kind of thing that belongs in AnnotationHandler?

> > 2) write a feature encoder for BagOfWords -- this is where the actual
> > normalization work is being done
> > 
> >     Actually, that's not true of our current setup - normalization is done
> >     in the FeaturesEncoder, not in the FeatureEncoder.
> > 
> Yes, obviously that's not how we're doing this now. But our current
> setup was also designed with the thought that you'd normalize the
> feature vector globally, not taking into account its internal structure.
> This approach fails in your scenario; attempting to work around that
> limitation will require a hack.

I'm not sure what you're proposing here. Could you elaborate?

> The way we broke things up makes a lot of sense,

Generally.

> it's extremely flexible and powerful,

Absolutely.

> at the same time the core idea is very simple and easy to understand,

I think the fact that we've spent so much time debating how to do things
proves that the core idea is *not* simple or easy to understand. I'm not
saying it's wrong. I'm just saying it's not always intuitive, and there
isn't always one obvious way to do things.

> and we managed to actually completely de-couple feature extraction
> from classifier choice

Also a good thing.

> Attempting to "fix" the system by blurring the boundaries between
> feature extraction and feature encoding that we created will severely
> weaken what we have.

I think you're misinterpreting me here. I'm not trying to blur the
boundaries - I just don't see them as clearly as you do.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 7:27

GoogleCodeExporter commented 9 years ago
[Philipp]

> Actually, that's *exactly* the kind of thing I want to normalize. I want
> to be able to specify arbitrary features that are conceptually grouped.
> For example, I might want to group together all lexical features or all
> syntactic features. And once they're grouped, I might do any number of
> things: normalization by group, adding additional weight to one group or
> another, etc.

Ok... I'm not sure I see why it would make sense to normalize an arbitrary 
subset of
the features that's not naturally grouped (like a bag of words would be), but 
the
fact that you're asking for it means I'm wrong about that. I guess I _can_ see 
why
for experimentation you'd want to, for example, ignore some features, where the 
set
of features to ignore is not easily apparent from the way they are extracted. 
Ok.

> Isn't specifying which features are conceptually part of a unit exactly
> the kind of thing that belongs in AnnotationHandler?

Yes, I agree that the annotation handler is the right place to encode the 
"structure"
of the features, including which features are conceptually grouped. I had 
assumed
that that grouping, where relevant, would always correspond to the way features 
are
extracted and could be encoded that way (see the BagOfWords example), but 
obviously I
was wrong.

The question is, then: Are the feature groupings that you'll want to use always
strictly hierarchical, never overlapping? I.e., is it impossible for a feature 
to
belong to more than one group, for the purpose of feature encoding? If we do not
place such a restriction, things get complicated, and I don't think we can avoid
changes in the API. But if we feel comfortable with keeping such restrictions, 
the
solution is simple, and we've already discussed it in this thread:

We introduce a FeatureGroup class (extends Object). A FeatureGroup has a name, 
and it
contains a set (list?) of Features, that's all. On the feature encoding side we
introduce a FeatureGroupEncoder (implements FeatureEncoder). The default
FeatureGroupEncoder works recursively like a FeaturesEncoder: it has a list of
FeatureEncoders and simply encodes the Features in the FeatureGroup one by one. 
Then
we can introduce other FeatureGroupEncoders that will dispatch only on 
FeatureGroups
of a given name and do things like normalization (or parameterize the default
FeatureGroupEncoder to be able to do that or whatever).

Required to get this to work with what we have:
- create the (trivial) FeatureGroup class
- the annotation handler can manually wrap lists of features in FeatureGroups 
(giving
the groups names, so they can be identified during feature encoding)
- create the encoder classes mentioned above, and change our default encoder
factories to include a trivial feature group encoder, which effectively 
flattens out
all the feature groups
- to customize behavior based on feature groups, write an encoder factory that
includes custom feature group encoders, which dispatch based on the name of a 
feature
group; each feature group encoder has its own list of feature encoders, 
customizing
the encoding of individual features in that group, and it may also do group-wide
operations such as normalizations

> > Yes, obviously that's not how we're doing this now. But our current
> > setup was also designed with the thought that you'd normalize the
> > feature vector globally, not taking into account its internal structure.
> > This approach fails in your scenario; attempting to work around that
> > limitation will require a hack.

> I'm not sure what you're proposing here. Could you elaborate?

I'm saying: The reason I implemented normalization in the FeaturesEncoder was 
that I
only intended normalization to be done on an entire feature vector, treating all
elements the same (I admit that that was pretty short-sighted of me). 
Normalization
of subsets of a feature vector should NOT be done in a FeaturesEncoder. We 
already
have functionality in place that handles special encodings of individual 
features:
the FeatureEncoders. So, in order to get normalization working on a subset of 
the
features, we should encapsulate that subset in one feature (e.g. the above 
mentioned
FeatureGroup object). This allows us to include a FeatureEncoder that 
dispatches only
on Features that have a FeatureGroup value (and then possibly only if the group 
has a
certain name); such a FeatureEncoder can then do its own normalization, which 
would
normalize all the features _in that feature group_, independent from the rest.

> I think you're misinterpreting me here. I'm not trying to blur the
> boundaries - I just don't see them as clearly as you do.

I didn't mean to imply that you intended to do that, merely that some of the
suggestions that have come up would have that effect. I believe the main 
problem (and
the reason that, as you say, this is NOT easy to understand) is that we're still
fuzzy on some of the concepts. That's why it's good we have these discussions.

The main distinction I'm trying to uphold here is the one between feature 
extraction
and feature encoding, which to me are two separate things. In my mental model I 
place
feature extraction entirely in the domain of annotation handlers, and feature
encoding in, well, the feature encoding code.

Feature extraction is, to me, the process of analyzing the "subject of 
analysis" or
SOFA (to use UIMA's terminology), and to identify and collect the presumed 
relevant
bits of information, with some limitations on the complexity of those bits of
information (e.g. strings are fine, but an entire parse tree is too complex). 
The
considerations influencing this process are, first of all, specific to the task 
one
is trying to accomplish; in our field, a lot of the time these will be 
linguistic
considerations, or a general intuition about which bits of information are 
useful and
which aren't. This can be done without any knowledge about how the bits of
information are used in the end -- the assumption is that the machine learning 
system
figures that out, as that is what it's designed to do.

Feature encoding, on the other hand, is not at all concerned with the subject of
analysis. It simply sees a collection of "bits of information" of various types 
and
has to bring them into a form that the machine learning system can use. It has 
to
struggle with the fact that most machine learning systems can't understand all 
types
of information that might arrive; and even if the ML system basically 
understands the
information, presenting it in a different way might improve overall performance
(think of presenting an integer number as one numeric SVM feature (123:3) vs. as
multiple binary ones (123:0 124:0 125:1 126:0)). The main consideration going 
into
feature encoding is a deep understanding of the exact ML algorithm that's being 
used:
e.g. what kind of normalization has which effect, how does the algorithm handle
numeric features when mixed with boolean features, how expensive is it to have a
large number of features, should I give the features long or short names, 
what's the
best way to encode a string into a numeric vector? I doubt that most potential 
users
of our system have the expertise to make many informed decisions in the context 
of
feature encoding, and if we just make sure to provide the most useful default
configuration they'd do best to leave it alone. I recognize of course that 
people
will want to experiment with it anyway, even though it may be blind 
experimentation.

For me, these two are conceptually AND practically distinct. Certainly some 
things
require simultaneous changes to both, but to me it's usually pretty clear what
functionality should go where. Am I alone in this, and does this clear 
distinction
not make sense?

Original comment by pvogren@gmail.com on 18 Feb 2009 at 7:45

GoogleCodeExporter commented 9 years ago
[Steve]
On 2/16/2009 11:44 PM, Philipp Wetzler wrote:

[snip description of FeatureGroup, FeatureGroupEncoder, etc.]

This sounds basically fine, but I don't think we need to put it into
ClearTK right now. I'm the only one who currently needs it, and it's
certainly not a feature for a basic user. I recommend that we let me
implement the functionality in my own code, use it for a while, and then
at some later point we discuss whether or not to add it to ClearTK.

> Feature extraction [...] can be done without any knowledge about how
> the bits of information are used in the end -- the assumption is that
> the machine learning system figures that out
> [...]
> The main consideration going into feature encoding is a deep
> understanding of the exact ML algorithm that's being used

This basically sounds like "feature extraction is task dependent and
classifier independent" and "feature encoding is classifier dependent
and (maybe) task independent". Is that right?

> For me, these two are conceptually AND practically distinct. Certainly
> some things require simultaneous changes to both, but to me it's usually
> pretty clear what functionality should go where. Am I alone in this, and
> does this clear distinction not make sense?

Well, I can't speak for 2PO, but I certainly wouldn't say that it's been
clear to me which functionality should go where. Consider the following
two examples of classifier-independent things you might want to do:

(1) Applying a Euclidean norm to feature vectors. This is pretty much
the standard for a TF-IDF document representation, regardless of what
classifier you plan to give that representation to.

(2) Making the training and testing data from two different runs
compatible such that the model trained on the training data can be
tested on the testing data (e.g. the feature names/indices match, etc.)

Both of these things should work for any classifier, so I consider them
classifier-independent. But they're both currently handled in the
feature encoding layer. Why?

Original comment by pvogren@gmail.com on 18 Feb 2009 at 7:48

GoogleCodeExporter commented 9 years ago
> This basically sounds like "feature extraction is task dependent and
> classifier independent" and "feature encoding is classifier dependent
> and (maybe) task independent". Is that right?

Effectively, that's how it seems to work out.

> [...] Consider the following
> two examples of classifier-independent things you might want to do:
> 
> (1) Applying a Euclidean norm to feature vectors. This is pretty much
> the standard for a TF-IDF document representation, regardless of what
> classifier you plan to give that representation to.

I believe you when you say that that's the standard thing people do, even for
classifiers that don't profit from it -- but that doesn't mean it makes any 
sense
whatsoever. When I consider how to take a list of TF-IDF values and put them 
into an
SVM training data file, it makes a lot of sense to consider normalization 
schemes,
because they _will_ make an immediate and predictable difference (assuming that 
I
know the SVM implementation well enough). I'm curious what justification people 
have
for normalizing their features without taking the classifier into consideration 
--
I'd really like to know, because I imagine there is a reason that I'm simply not
aware of.

So, going by my current understanding, this kind of normalization only becomes
meaningful in the context of a specific classifier. That doesn't mean you can't 
do it
on _every_ classifier, but to make an informed decision about using this
normalization you look at the classifier, not the task. So unless there is 
another
reason to do normalization, this would be feature encoding.

If there is another reason, of course, and the normalization is NOT being done 
for
the sake of the classifier, then that normalization should be done during 
feature
extraction.

> (2) Making the training and testing data from two different runs
> compatible such that the model trained on the training data can be
> tested on the testing data (e.g. the feature names/indices match, etc.)
>
> Both of these things should work for any classifier, so I consider them
> classifier-independent. But they're both currently handled in the
> feature encoding layer. Why?

Actually, (2) is _not_ necessary for every classifier -- there are various
classifiers that do their own mapping (i.e. the training data we generate 
simply uses
names instead of indices), right? So that alone settles it. Reason 2: the 
mapping
requires knowing how features are encoded (e.g. numbers: one feature index or 
many?
it affects the mapping). How features are encoded is definitely classifier 
dependent
(meaning an informed decision takes the classifier into account). Reason 3: the 
only
reason we care about having feature indices is that because of the way ML 
algorithms
work most training data formats require us to use them -- we only have them to
accommodate classifier limitations.

Looking at the explanations I wrote I guess I'd say: task-dependent vs.
classifier-dependent, yes. But dependent not in the sense of "I can't do the 
same for
different classifiers (or tasks)", but "to make an _informed_ choice about how 
to do
it, I need to primarily consider the classifier (or task)". Of course a user can
still pick an arbitrary normalization scheme because they read in that one 
paper that
"they normalized the data, and it improved accuracy" (never mind that their 
whole
setup was completely different). Users can do that, wherever we put that
functionality. But if we're not careful about where things should go it makes 
life
difficult for users who know what they're doing and want to customize the 
system to
do what they _know_ will work.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:00

GoogleCodeExporter commented 9 years ago
[Steve] 
> > 
> >     This basically sounds like "feature extraction is task dependent and
> >     classifier independent" and "feature encoding is classifier dependent
> >     and (maybe) task independent". Is that right?
> > 
> Effectively, that's how it seems to work out.

Well that's a good rule of thumb that we should document somewhere.

> > [...] Consider the following
> > two examples of classifier-independent things you might want to do:
> > (1) Applying a Euclidean norm to feature vectors. This is pretty much
> > the standard for a TF-IDF document representation, regardless of what
> > classifier you plan to give that representation to.
> > 
> I believe you when you say that that's the standard thing people do,
> even for classifiers that don't profit from it -- but that doesn't mean
> it makes any sense whatsoever.

"Finally, ltc weighting handles differences in document length by cosine
normalizing the feature vectors (normalizing them to have a Euclidean
norm of 1.0)."
  -- David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li
     RCV1: A New Benchmark Collection for Text Categorization Research

They don't mention anything about specific classifiers here, and it
sounds like a task-based reasoning. But more importantly, I suspect they
do this because it's what everyone else has done in the past, and they
want to be able to compare results.

I think trying to stop people from doing normalization for any
classifier they want is a *big* mistake. ClearTK should make whatever it
has available, and let people mix-and-match as they like, regardless of
whether or not *we* think it makes sense.

<digression>
I feel pretty strongly about this, given my experiences distributing the
argparse Python library. Argparse started as an extension of optparse,
and optparse makes claims like:

  "Some other option syntaxes that the world has seen include...
  "-pf"..."-file"..."+rgb"..."/file"... These option syntaxes are not
  supported by optparse, and they never will be. This is deliberate: the
  first three are non-standard on any environment...

This is foolish. Just because you don't like a particular thing is no
justification to keep other people from doing it. Give them some credit
- they probably have their reasons. For example, some of my argparse
users explained that they had to maintain backwards compatibility with
an existing command line interface. With argparse, they can do that
because it doesn't tell them how to design their own command lines. With
optparse, they can't.
</digression>

For ClearTK, I argue that people may have their own reasons for
normalizing for any classifier, and we shouldn't keep them from doing
that just because we think it's the wrong thing to do. Give them some
credit - they probably have their reasons.

In general, I think that anything that *can* be done for any classifier,
regardless of whether or not we think it *should* be done, should be
available in the feature extraction layer.

> > (2) Making the training and testing data from two different runs
> > compatible such that the model trained on the training data can be
> > tested on the testing data (e.g. the feature names/indices match, etc.)
> > 
> Actually, (2) is _not_ necessary for every classifier

That's not right. Making compatible training and testing data *is*
necessary for every classifier. True, for some classifiers there's some
code to do this, and for some classifiers it's a no-op. But the task of
generate matching training and testing data is common to all classifiers.

Let me ask the question a different way. Right now, the creation of
encoders is done in DataWriter_ImplBase, and it has a couple of options:

(1) If an EncoderFactory is specified it creates that object
(2) If an EncoderFactory is not specified it creates the default object

Why it doesn't make sense to add a third:

(3) If (somehow) requested, it loads the object from a file

All of these tasks are "get me some encoders" tasks. Why do the first
two belong in DataWriter_ImplBase, but the third belongs in an
EncoderFactory?

Steve

P.S. Remember that part of the point of this discussion is to explain
why the boundaries aren't as clear for others as they are for you. Can
you at least see why there's some confusion as to what goes where?

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:06

GoogleCodeExporter commented 9 years ago
[Philipp]
> They don't mention anything about specific classifiers here, and it
> sounds like a task-based reasoning. But more importantly, I suspect they
> do this because it's what everyone else has done in the past, and they
> want to be able to compare results.
> 
> I think trying to stop people from doing normalization for any
> classifier they want is a *big* mistake. ClearTK should make whatever it
> has available, and let people mix-and-match as they like, regardless of
> whether or not *we* think it makes sense.

As I _tried_ to say in my response, I _do not_ advocate keeping people from 
doing
whatever they want. And the quote you've given above only says that normalizing
compensates for document length, not why document length would be an issue 
otherwise.
I'm actually interpreting this to be classifier-based reasoning (I know it 
would be
an issue for some _classifiers_, but I'm also pretty sure there are some where 
it
wouldn't), but the quote doesn't actually say.

I am in no way saying we shouldn't let people mix and match all the 
functionality we
have, in whatever way they like -- as you say, I'm sure they have their reason. 
I'm
not saying we should make it difficult, either. I'm just saying we should 
structure
our code so that functionality that's necessary in order to accommodate 
classifiers
is kept on one side, whereas functionality that arises from the task itself, 
ignoring
the classifier, is kept on the other.

> In general, I think that anything that *can* be done for any classifier,
> regardless of whether or not we think it *should* be done, should be
> available in the feature extraction layer.

There's very little that _can't_ be done for every classifier. If we follow this
rule, we might as well scrap the whole feature encoding layer and pack it all 
into
feature extraction. The resulting output of feature extraction will, 
technically, be
usable with any classifier, but actually it will be necessary to hand-optimize
feature extraction for different classifiers to make best use of their 
capabilities.
In our current model, when switching the classifier the feature extraction code 
can
be left alone.

E.g., for pretty much all classifiers you *can* l2-normalize the features, so 
let's
say we put that functionality into feature extraction, and because I'm using 
SVMlight
and l2-normalization helps with that I'll turn it on. But then I'm switching to 
a
different SVM implementation, and the documentation explains that, due to the
different algorithm they use performance will be better if I scale all features 
to
within [0, 1]. I _could_ just ignore that advice, because, after all, I *can* 
still
use l2-normalization. But in practice, because I care about getting good 
performance,
I'll have to change feature extraction to accommodate a new classifier.

I'm just going to quickly mention the possibility of a classifier that only 
takes
boolean features, not numeric ones. With our current split that can be 
accommodated
easily.

I don't understand why this is any problem at all. No one is advocating stopping
people from doing anything. If anything, I'm advocating a framework that 
encourages
people to evaluate their choices in the proper context -- without forcing them 
to do so.

> Let me ask the question a different way. Right now, the creation of
> encoders is done in DataWriter_ImplBase, and it has a couple of options:
> 
> (1) If an EncoderFactory is specified it creates that object
> (2) If an EncoderFactory is not specified it creates the default object
> 
> Why it doesn't make sense to add a third:
> 
> (3) If (somehow) requested, it loads the object from a file
> 
> All of these tasks are "get me some encoders" tasks. Why do the first
> two belong in DataWriter_ImplBase, but the third belongs in an
> EncoderFactory?

As I have said before, I do not think that (2) belongs in DataWriter_ImplBase;
consequently (3) doesn't either. I suggested before to remove (2) and instead 
always
give a default factory in our descriptor files (see our last big thread on this
subject). DataWriter_ImplBase shouldn't be concerned with how (or where) an 
encoder
is created, it should delegate that task to a factory.

> P.S. Remember that part of the point of this discussion is to explain
> why the boundaries aren't as clear for others as they are for you. Can
> you at least see why there's some confusion as to what goes where?

I can see *that* there is confusion. I'm really trying to, but even with all 
those
examples I honestly don't seem to get *why* -- not sure what that says. I also 
don't
get why it is an issue. Do you consider it problematic that, in order to get 
their
desired behavior, people will need to make some changes to feature encoding, in
addition to whatever they're doing in feature extraction, instead of having to 
do the
same amount of work all in feature extraction? Even at the cost of giving up 
(at the
least) some degree of classifier transparency?

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:12

GoogleCodeExporter commented 9 years ago
[Steve]
> I'm just saying we should structure our code so that
> functionality that's necessary in order to accommodate classifiers is
> kept on one side, whereas functionality that arises from the task
> itself, ignoring the classifier, is kept on the other.

I'm still unable in practice to make the task/classifier distinction in
the same way you do. If I'm doing a task where the standard
representation is TF-IDF with Euclidean normalization, I think of that
as part of the task because it's part of the representation of the
feature space [1]. But you think of it as part of the classifier (I
gather) because it may be more or less effective depending on the
classifier.

[1] Note that to me this is different from the SVM having to encode
feature names as numbers. I can just as easily normalize while the
feature names are still strings.

> > P.S. Remember that part of the point of this discussion is to explain
> > why the boundaries aren't as clear for others as they are for you. Can
> > you at least see why there's some confusion as to what goes where?
> > 
> I can see *that* there is confusion. I'm really trying to, but even with
> all those examples I honestly don't seem to get *why* -- not sure what
> that says. I also don't get why it is an issue.

It's an issue because, as a user of ClearTK, I don't know where best to
put things. This discussion started because there were two approaches to
implementing the kind of feature groupings I needed: normalization
during feature extraction, and normalization during feature encoding.
Both would achieve my goals equally well, and both seem to be about the
same amount of code. My first intuition is to do it during feature
extraction because it's part of the task representation, but your first
intuition is to do it during feature encoding.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:15

GoogleCodeExporter commented 9 years ago
> I'm still unable in practice to make the task/classifier distinction in
> the same way you do. If I'm doing a task where the standard
> representation is TF-IDF with Euclidean normalization, I think of that
> as part of the task because it's part of the representation of the
> feature space [1]. But you think of it as part of the classifier (I
> gather) because it may be more or less effective depending on the
> classifier.

Maybe I should define what I mean by "task": Let's say you're doing document
classification, for example by topic. When I say "task", I mean just that: 
deciding
if document X is topic A or topic B. I'm not talking about what people usually 
do for
this kind of problem; I'm not talking about a "task" at a conference that gives 
you
specific framing conditions; I'm not talking about reproducing the approach of
someone else; I'm not talking about using a specific "feature space". All of 
those
things are important, but they're not part of what I call the "task".

But for document classification, certain pieces of information are known to be 
useful
independent of all external framing conditions. The presence of certain words is
known to be a useful bit of information. It's also known that the frequency of 
a word
in the document divided by its frequency in the corpus is useful (and it's a 
distinct
piece of information). These are useful, because they carry information about 
the
topic, they are task-specific. Multiplying the number I use to represent that
information by 0.5 does NOT give me any more information about the topic. 
Certainly
many researchers also used normalization schemes on the resulting data, and it
improved their performance. But that improvement is not because "normalization 
is a
good thing to do for document classification", but because "normalization is a 
good
thing to do for many ML algorithms". Can you see that at all?

Of course people will want to reproduce what other researchers did, or do what's
considered good practice. So they can do that by customizing feature encoding 
along
with feature extraction. The "representation of the feature space" is a result 
of
both of them, combined.

> It's an issue because, as a user of ClearTK, I don't know where best to
> put things. This discussion started because there were two approaches to
> implementing the kind of feature groupings I needed: normalization
> during feature extraction, and normalization during feature encoding.
> Both would achieve my goals equally well, and both seem to be about the
> same amount of code. My first intuition is to do it during feature
> extraction because it's part of the task representation, but your first
> intuition is to do it during feature encoding.

Ok, yes, that is an issue. I'm not sure how to deal with it.

Obviously I have a different background from you two. I started out in ML, and I
first applied it to a completely separate kind of problem (computer vision) 
before
coming to NLP. I guess it's not surprising that my mind would break down the 
problem
in a different way, and it seems clear by now that I'm unable to explain that 
way to
you. On the other hand, I can't let it go, because it's clear to me that cleanly
breaking things into extraction and encoding is much, much better, and will make
things far easier in the future; I do not want to go back to the old way. So to
summarize, I don't know what to do about it.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:19

GoogleCodeExporter commented 9 years ago

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:20

GoogleCodeExporter commented 9 years ago
> But for document classification, certain pieces of information are known
> to be useful independent of all external framing conditions. The
> presence of certain words is known to be a useful bit of information.
> It's also known that the frequency of a word in the document divided by
> its frequency in the corpus is useful (and it's a distinct piece of
> information).

It's also known that it's important to account for differences in
document length. (Hence the normalization.)

Why is document length a classifier specific thing?

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:23

GoogleCodeExporter commented 9 years ago
[Philipp]
Can you give me any reason why accounting for document length is important (I 
mean a
detailed explanation, showing how the way you account for it affects the final
outcome) that does NOT involve an ML classifier? 

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:24

GoogleCodeExporter commented 9 years ago
[Steve]
Probably no better than anyone can explain a theoretical motivation for
TF-IDF. But here goes:

  A word occurring once in a 10 word document is more important than a
  word occurring once an a 100 word document because in the 10 word
  document, it makes up a larger part of the document content.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:25

GoogleCodeExporter commented 9 years ago
[Philipp]
So why, instead of normalization, don't you just include another feature that 
says
"this document is 100 words long"? It would certainly be easier, and the 
information
content is the same (or even higher).

AFAIK people choose to do normalization instead because that way ML systems are 
much
less easily confused.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:27

GoogleCodeExporter commented 9 years ago
[Steve]
Sure, there's a hundred different ways to encode any feature. Why use
TF-IDF? Why not use a TF feature and an IDF feature for every word? "The
information content is the same (or even higher)"

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:27

GoogleCodeExporter commented 9 years ago
[Philipp]

That's correct. Our TF-IDF extractor does mix concepts and is a bit classifier
specific. I've thought about trying to rewrite it, but haven't had time to 
really
think it through yet.

Yes, there are a hundred different ways to encode any feature. And that's 
exactly why
feature encoding shouldn't be done together with feature extraction. Feature
extraction is about gathering information, not about how to represent it. The
"classifier dependent / classifier independent" distinction only arises from 
that
because classifiers tend to be very picky about the way a feature is _encoded_, 
while
they don't care at all what _information_ a feature carries.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:28

GoogleCodeExporter commented 9 years ago
[Steve]
Some other things that should be feature encodings under this logic:

* SyntacticPathExtractor - converting the parts of the path into a
  "XX::YY;;XX" string with is representing information, not gathering it

* SubCategorizationExtractor - combining the parent and child nodes into
  a "XX -> YY ZZ" string is representing information, not gathering it

* NGramExtractor - joining the pieces of the ngram into a "xx|yy|zz"
  string is representing information, not gathering it

Is this really what you're proposing?

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:29

GoogleCodeExporter commented 9 years ago
[Philipp]
I did always feel it was awkward to flatten out that information -- what if 
someone
came up with a classifier that could generalize over _parts_ of a syntactic path
(i.e. you give the classifier two paths, and the classifier sees "the first 
three
path elements are the same" and generalizes over that). There are even now 
things
like SVMstruct, and custom kernels and such, so it is conceivable for such a
classifier to exist.

So strictly, yes, that is what I'm proposing. Those extractors should create a
feature with a custom value that encapsulates the parts, and a feature encoder 
should
take that value and encode it in such a way that the classifier can use it.

Now, I'm not saying we can't cheat a little bit, especially if it's in isolated 
cases
(special purpose extractors that aren't used everywhere). But complex 
extractors that
are used in lots of places, or functionality that is universal (like 
normalization
schemes) should be done right.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:31

GoogleCodeExporter commented 9 years ago
[Steve]
Well, I at least now see where you're going: anything that's just taking
a piece of information from the CAS is feature extraction, doing
anything at all with that information is feature encoding.

That said, I'm probably always going to "cheat" in my own code and put
most functionality into feature extractors because there's only one
class to implement instead of three, and when you're done it works for
all classifiers instead of just one. But I'm fine with keeping my more
practical (but less pure) code out of ClearTK.

Steve

P.S. I think my current plan will probably be to create a
EuclideanNormExtractor which takes as constructor parameters other
SimpleFeatureExtractors. When .extract() is called, it will collect all
their Features, apply normalization and return the resulting
List<Feature>. This way, I can group features arbitrarily for
normalization by simply creating more than one EuclideanNormExtractor.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 8:56

GoogleCodeExporter commented 9 years ago
[Philipp]
It's really one class instead of two (feature extractor / feature encoder) plus
adding one line to the encoder factory. And since feature encoders can be used 
for
not only one type of classifier, the result does work with most classifiers (and
where it doesn't it's trivial to get it to work to the extent that your approach
does). With eclipse's help in writing Java boilerplate for you, you end up 
writing
the same amount of code in either case. They're both equally practical in that 
sense.

It would be helpful at this point to have 2P's input. It appeared before, 
however,
that his perspective was similar to yours. That being the case, it makes much 
more
sense for you two to structure feature extraction / feature encoding / whatever 
you
want to call it the way that seems right to you. I can maintain my own set of 
changes
to implement it the way I prefer it.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 9:10

GoogleCodeExporter commented 9 years ago
[Steve's summary]

Steve's view: feature extractors are for classifier independent code
--------------------------------------------------------------------
Anything involving features that is classifier independent belongs in
the feature extraction layer. Things that are classifier dependent (e.g.
the string to number conversions of SVMs) belong in the feature encoding
layer.

Example: Converting a syntactic path to "NP::S;;VP" is classifier
independent, so it belongs in feature extraction.

Example: Euclidean normalization can be applied to features for any type
of classifier, so it belongs in feature extraction.

Feature extractors are easier to create and use because you only need to
create a single class (e.g. EuclideanNormalizationFeatureExtractor) and
use it in your AnnotationHandler.

Feature extractors also have the advantage of working for any classifier.

3P's view: feature extractors are only for selecting pieces of the CAS
----------------------------------------------------------------------
The only thing that feature extractors should do is look at the CAS and
select pieces of it. Anything that modifies, combines, etc. the pieces
of the CAS belongs in the feature encoding layer.

Example: Extracting a path of NP, S and VP nodes from the CAS belongs in
feature extraction, but converting those objects to the string
"NP::S;;VP" is a representation issue so it belongs in feature encoding.

Example: Euclidean normalization is a transformation of information
extracted from the CAS, so it belongs in feature encoding.

Feature encoders are easy enough to use. You just need to create a new
feature encoder class (e.g. EuclideanNormalizationFeatureEncoder),
create a new encoder factory class which inherits from an existing
encoder factory (e.g. SVMEncoderFactory) and adds a single call to
addEncoder(), and then specify your new encoder factory using the
"EncoderFactoryClass" parameter to DataWriter_ImplBase.

Feature encoders aren't totally classifier independent, but in many
cases, your code would work for multiple classifiers (e.g. all SVMs, and
more if we can merge ContextValue and FeatureVector).

Steve

Original comment by pvogren@gmail.com on 18 Feb 2009 at 9:11

GoogleCodeExporter commented 9 years ago
[Philipp's summary]

There's no misrepresentation, I'm just going to rephrase a bit where I think the
terminology is unclear. For one thing, let's not use the "classifier dependent" 
/
"classifier independent" terms, because we both use them in different ways. I'm
trying to be fair in representing both sides, let me know if you disagree with 
the
way I'm phrasing things.

Steve's view
-------------------
Anything involving features that can in principal be applied to all classifiers
belongs in the feature extraction layer. Things that only apply to specific
classifiers (and can't reasonably be applied to others), such as the string to 
number
conversions of SVMs, belong in the feature encoding layer.

Example: Creating a syntactic path feature such as the string "NP::S;;VP" can 
be done
(and is potentially useful) for any classifier, so it belongs in feature 
extraction.

Example: Euclidian normalization can be applied to features for any type of
classifier, so it belongs in feature extraction.

Feature extractors are easier to create and can be immediately applied to any
classifier. In exchange they commit to one specific representation of the 
feature,
which may not give best results with all classifiers, and which can only be 
optimized
to a different classifier by changing the code in the feature extractor.

Philipp's view:
--------------------- 
Anything involving features that is potentially affected by the choice of 
classifier
should go into feature encoding. Things that *can* be applied to any 
classifier, but
have potentially different effects, should also go into feature encoding. Only 
things
that are not related to classifier choice in any way should go into feature 
extraction.

Example: Extracting a path of NP, S and VP nodes from the CAS belongs in feature
extraction, but converting those objects to the string "NP::S;;VP" is only one
possible representation; some classifiers may allow a more powerful 
representation,
so the choice to create that string should be made in feature encoding.

Example: Euclidian normalization is a transformation of information extracted 
from
the CAS; there is an infinite number of such transformations that could 
conceivably
be applied, and the choice of classifier dictates which ones promise good 
results and
which ones don't. Thus it belongs in feature encoding.

Creating a new kind of feature extractor in this model requires a bit more 
work. The
feature extractor itself is much simpler. But you also create a new feature 
encoder
class (e.g. EuclidianNormalizationFeatureEncoder), which does the main work. 
Then you
modify your encoder factory (or subclass a default one, if you're not using a 
custom
one yet) and add a single call to addEncoder() with the new encoder as 
argument. The
factory class is passed to the DataWriter as a parameter as always.

This does not automatically let you use the feature extractor for all 
classifiers. To
make it work with another classifier, you might have to subclass another encoder
factory. If the classifier works in a very different way, you might have to 
write
another feature encoder. In exchange the user of this feature extractor can use 
it in
their annotation handlers with no consideration to the type of classifier used. 
When
switching to a new classifier, it's always possible to achieve optimum 
performance
with that classifier by only customizing feature encoding, not the annotation
handler. It's also easier to experiment with different ways of representing a 
feature
by using different feature encoders.

Original comment by pvogren@gmail.com on 18 Feb 2009 at 9:16

GoogleCodeExporter commented 9 years ago
[Philip Ogren]

I can see valid points on both sides of the argument.  However, I think that 
Philipp
has made a clearer case for his approach.  Let me start by going through our 
working
examples:

- syntactic path example.  For one, it is no extra work for the encoder for it 
to
receive a Feature whose value was a syntactic path object and do the default 
thing
which is to convert it to a string - presumably the syntactic path object knows 
how
to do this for the encoder anyways.  For two, doing this sort of thing 
complicates
feature proliferation - if there are "sub-features" to be had from the 
syntactic path
then getting them out of a string representation is a pain (but this is an 
aside). 
For three, suppose there is some svm kernel that can really take advantage of
structured features - why make the annotation handler worry about this?  I would
argue that we did the syntactic path features wrong.  If you look at the
WindowFeature - we don't just automatically create a name/value pair where the 
value
encodes all of the pertinent information - we actually pass a value that has 
all of
the information explicitly represented.

- Euclidean normalization - I think Philipp's arguments are more compelling 
here and
I like his three step solution for accomplishing what you need.  I think it is 
out of
place for the AnnotationHandler to be deciding which normalization technique to 
use -
let the encoder factory set this up. 

- binning values (my example) - When I use maxent - since it doesn't really 
handle
numeric values in the same way that svm's do it is convenient to bin them (e.g. 
high,
medium, low).  Why should the annotation handler care which classifier is being 
used
and how best to bin feature values - it shouldn't.

- tf/idf - Philipp and I discussed this and decided that it makes sense for
annotation handler to count up term frequencies and that's it.  IDF values are 
going
to come from some precomputed value and they can be used just as easily in the
feature encoder as in the annotation handler.  And aren't there like 15 ways to
calculate TF/IDF?  Of course, how the various ways calculating TF/IDF should be
abstracted out - but it seems to me that deciding how term frequency 
information is
presented to the classifier is a job for the encoder not the extractor. 

Architecturally - I think Philipp's proposal is the right one and we should go 
down
that route.  The distinction between feature extraction and feature encoding is 
clear
and it will be a much more powerful and flexible approach.  One of my mental 
hangups
is that I have this nagging intuition that Steve's approach "would just be 
easier". 
One of my hangups is getting used to the idea of having many different feature
encoding scenarios for a particular annotation handler.  After all, we started 
off
with no feature encoding assuming that once we had the features - it was just a
matter of creating the right file format.  However, I think that when we get 
used to
writing EncoderFactories to go with different annotation handler / classifier
combinations - this will all start to feel quite natural.  Solutions for how to 
make
expected behavior the default or easy-to-use will become obvious I think.  I 
don't
think that Philipp's assessment that creating new feature encoders is going to 
be
harder than creating new feature extractors is correct.  In most cases we can 
treat
the encoder factory as similar to a configuration file and a new factory class 
will
be the only thing required of a developer for a new feature encoding strategy 
in most
cases. 

Original comment by pvogren@gmail.com on 19 Feb 2009 at 12:52

GoogleCodeExporter commented 9 years ago
Well, 2 against 1, so that settles it. In ClearTK, feature extractors will only
select pieces of the CAS, and will never combine or transform these in any way 
(e.g.
they will never format objects as strings or apply normalization to feature 
values).

One of you should put together a tutorial/explanation of what the different 
layers
are for, and what should go where. It would be good to warn people that for most
complex tasks, they'll end up writing both an AnnotationHandler and an
EncoderFactory. That's not unreasonable - "AnnotationHandler" and
"EncoderFactoryClass" are both parameters for DataWriters.

I personally don't like having to write and synchronize two classes for every 
task I
do. So in my own code, I'm probably going to continue to do everything at the 
feature
extraction level. But I'll make sure to keep that code out of ClearTK.

Feel free to rewrite my TF/IDF code to move parts into the encoding layer. It's 
not
really clear to me how you'd do that, so I'll be interested to see what you 
guys come
up with.

Original comment by steven.b...@gmail.com on 19 Feb 2009 at 1:10

GoogleCodeExporter commented 9 years ago
too bad we didn't have the dev list running before this thread was started.  
duh!  I
think I will post a note the the list just so that this thread is searchable 
from the
list archive.  

Philipp has opened a real issue related to the first posts of this issue at #74.

Original comment by pvogren@gmail.com on 18 Mar 2009 at 10:06