Extract subset of features available in a feature extractor

The majority of feature extractors extract more than one feature.

However, we might not need some of these features: e.g. POSFeatureExtractor extracts POS-tags for both source and target, and we might decide to use only target tags. Different features are grouped together because they use the same resources. Sometimes their joint extraction saves some time, but I think it's not an issue for most of extractors.

We can:

split feature extractors so that each returns only one value
inform a feature extractor about the subset of features we want to extract.

First approach: pros:

easier to find which extractor to use
looks cleaner

cons:

config might become bigger
some extractors can become inefficient (I don't know which of them though)

Second approach: pros:

easier to implement now

cons:

user doesn't know which features are extracted by an extractor
requires making many small uniform changes to all feature extractors, might create new bugs.

good point. since the features are named, it's already easy to extract a subset of them from a structure with named columns like a pandas dataframe [1]. that's what I do in some experiments.

i think in future versions we can add to the feature extractor API to allow specifying a subset, but it's not critical because the features can be selected by name.

[1] http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html

On Fri, Mar 27, 2015 at 6:27 PM, varvara-l notifications@github.com wrote:

The majority of feature extractors extract more than one feature.

However, we might not need some of these features: e.g. POSFeatureExtractor extracts POS-tags for both source and target, and we might decide to use only target tags. Different features are grouped together because they use the same resources. Sometimes their joint extraction saves some time, but I think it's not an issue for most of extractors.

We can:

split feature extractors so that each returns only one value

inform a feature extractor about the subset of features we want to extract.

First approach: pros:

easier to find which extractor to use

looks cleaner cons:

config might become bigger

some extractors can become inefficient (I don't know which of them though)

Second approach: pros:

easier to implement now cons:

user doesn't know which features are extracted by an extractor

requires making many small uniform changes to all feature extractors, might create new bugs.

— Reply to this email directly or view it on GitHub https://github.com/qe-team/marmot/issues/33.

qe-team / marmot

Extract subset of features available in a feature extractor #33