qe-team / marmot

MARMOT - the open source framework for feature extraction and machine learning, designed to estimate the quality of Machine Translation output
ISC License
21 stars 7 forks source link

Extract subset of features available in a feature extractor #33

Open varvara-l opened 9 years ago

varvara-l commented 9 years ago

The majority of feature extractors extract more than one feature.

However, we might not need some of these features: e.g. POSFeatureExtractor extracts POS-tags for both source and target, and we might decide to use only target tags. Different features are grouped together because they use the same resources. Sometimes their joint extraction saves some time, but I think it's not an issue for most of extractors.

We can:

First approach: pros:

cons:

Second approach: pros:

cons:

chrishokamp commented 9 years ago

good point. since the features are named, it's already easy to extract a subset of them from a structure with named columns like a pandas dataframe [1]. that's what I do in some experiments.

i think in future versions we can add to the feature extractor API to allow specifying a subset, but it's not critical because the features can be selected by name.

[1] http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html

On Fri, Mar 27, 2015 at 6:27 PM, varvara-l notifications@github.com wrote:

The majority of feature extractors extract more than one feature.

However, we might not need some of these features: e.g. POSFeatureExtractor extracts POS-tags for both source and target, and we might decide to use only target tags. Different features are grouped together because they use the same resources. Sometimes their joint extraction saves some time, but I think it's not an issue for most of extractors.

We can:

  • split feature extractors so that each returns only one value
  • inform a feature extractor about the subset of features we want to extract.

First approach: pros:

  • easier to find which extractor to use
  • looks cleaner cons:
  • config might become bigger
  • some extractors can become inefficient (I don't know which of them though)

Second approach: pros:

  • easier to implement now cons:
  • user doesn't know which features are extracted by an extractor
  • requires making many small uniform changes to all feature extractors, might create new bugs.

— Reply to this email directly or view it on GitHub https://github.com/qe-team/marmot/issues/33.