ghpaetzold / questplusplus

Pipelined quality estimation.
49 stars 14 forks source link

Feature responses generation #37

Open kaayushi opened 7 years ago

kaayushi commented 7 years ago

How can we generate the responses for the training set of an arbitrary language pair? It is required for as an input for the machine learning module.

carolscarton commented 7 years ago

Hi, I am not sure I understood your question. Do you want to extract feature for a new language pair?

kaayushi commented 7 years ago

No. The machine learning module takes 2 files as inputs. One of them is feature file for training data and other is the response file for training data. So how do we generate those responses?

carolscarton commented 7 years ago

Hi, could you please clarify what you mean by "responses"?

Yes, the machine learning module expects two files as input for training. For instance, from learning/config/svr.cfg:

x_train: data/features/wmt2012_qe_baseline/training.qe.baseline.tsv y_train: data/features/wmt2012_qe_baseline/training.effort

the x_train file contains the features and the y_train files contains the labels that we want to learn.

kaayushi commented 7 years ago

Okay. So what do those labels mean exactly? And how can I generate them ? @carolscarton

carolscarton commented 7 years ago

Hi, the y_train file contains the quality label that you expect your model to learn. These quality labels in QE are usually some kind of human annotation: post-editing time, post-editing effort, HTER, etc. This process is the same as any supervised machine learning approach and, therefore, it is expected that you provide such labels. QuEst++ cannot generate them for you.

In order to define the labels you will need to have a scenario for which you want to learn some measurement of quality. For instance, if you want to build a model able to predict how much time a post-editor will need to fix a machine translated sentence, you may consider time as a good label. In order to have such labels, you will need to get humans to post-edit your training set and record the time that they needed to perform the task for each sentence.

I believe you may be interested in read more about the topic before proceeding:

1-) About machine learning and labels:

2-) About QE: