qe-team / marmot

MARMOT - the open source framework for feature extraction and machine learning, designed to estimate the quality of Machine Translation output
ISC License
21 stars 7 forks source link

Discussion: adding more training data #28

Open chrishokamp opened 9 years ago

chrishokamp commented 9 years ago

We don't currently provide a clean way to augment the training data. We should be able to easily test if adding data helps. The example usecase would be using other terp-aligned word level datasets (i.e. wmt15) to test if performance improves on wmt14.

chrishokamp commented 9 years ago

when this is supported, each dataset will need a key describing its name/role. i.e. 'train' or 'test'. The parts of the pipeline which do different things with different parts of the data, such as learning vs. predicting vs. evaluating, should find the keys that they need in this way.

varvara-l commented 9 years ago

We already have this. The "datasets" field in the config has fields "training" and "test", representation generators are declared under them. Any new fields can be added as well.

There is no fancy handling of this in the code, because we just tell the system that training data is something defined in config['datasets']['training'].

chrishokamp commented 9 years ago

the issue is that we cannot retrieve the features sets by name later in the pipeline. we only support one generator for training and test, so users cannot combine the wmt14 and 15 datasets into one training dataset without writing a bit of code, for example. I don't know what the right way to handle this is, but it feels like we should at least be able to combine several training datasets from different sources.

Also in a real testing scenario, the user won't have the gold-standard labels, and they'll probably be evaluating their models using cross validation, so we should account for that in the 'learning' step.

On Fri, Feb 20, 2015 at 5:19 PM, varvara-l notifications@github.com wrote:

We already have this. The "datasets" field in the config has fields "training" and "test", representation generators are declared under them. Any new fields can be added as well.

There is no fancy handling of this in the code, because we just tell the system that training data is something defined in config['datasets']['training'].

— Reply to this email directly or view it on GitHub https://github.com/qe-team/marmot/issues/28#issuecomment-75279073.