qe-team / marmot

MARMOT - the open source framework for feature extraction and machine learning, designed to estimate the quality of Machine Translation output
ISC License
21 stars 7 forks source link

Discussion: Training data selection based on features of the test set #27

Open chrishokamp opened 9 years ago

chrishokamp commented 9 years ago

We haven't really covered this topic at all, but it could make a big difference. The WMT14 data has many sentences which don't seem to have any connection to the test set. Removing these instances is likely to improve performance.

In general, selecting a training set which has good coverage of the features in the test set is a good idea.