ghpaetzold / questplusplus

Pipelined quality estimation.
49 stars 14 forks source link

Oracle word level labels #28

Open znavoyan opened 8 years ago

znavoyan commented 8 years ago

Hello, I'm not sure if I'm posting in the right place. I have read the article "Multi-level Translation Quality prediction with QuEst++" by Lucia Specia et al. There authors claim that they achieve 0.159 MAE on WMT15 English-Spanish task. I have also able to reproduce this results. Besides that, they reported that they achieve 0.07 MAE by using Oracle word-level labels in addition to baseline features. Can you please explain what is Oracle word-level labels or point a link, where I can read more about them.

Thanks, Zaven Navoyan.

ghpaetzold commented 8 years ago

Hi Zaven, The oracle word-level labels are simply the gold-standard labels for the word-level task. :)

Regards,___ Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield

Date: Wed, 20 Apr 2016 06:46:45 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: Subject: [ghpaetzold/questplusplus] Oracle word level labels (#28)

Hello, I'm not sure if I'm posting in the right place.

I have read the article "Multi-level Translation Quality prediction with QuEst++" by Lucia Specia et al.

There authors claim that they achieve 0.159 MAE on WMT15 English-Spanish task. I have also able to reproduce this results. Besides that, they reported that they achieve 0.07 MAE by using Oracle word-level labels in addition to baseline features. Can you please explain what is Oracle word-level labels or point a link, where I can read more about them.

Thanks,

Zaven Navoyan.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

bittlingmayer commented 8 years ago

Just to be clear, those labels are used only in training, not in actual test, correct?

So the idea is, to reproduce that generally, one would need to have human scores or labels of some sort on the word level in the training data, but nothing additional in the live system?

ghpaetzold commented 8 years ago

The labels are also used for the test set, given that we need the same set o features during training and testing.The goal of using them was to find evidence that the quality of words provide with valuable insight on the quality of the sentence.In order for you to do this realistically, yes, you would have to find a way to estimate extremely reliable word-level quality estimates. :)


Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield

Date: Thu, 21 Apr 2016 12:03:38 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: ghpaetzold@outlook.com; comment@noreply.github.com Subject: Re: [ghpaetzold/questplusplus] Oracle word level labels (#28)

Just to be clear, those labels are used only in training, not in actual test, correct?

So the idea is, to reproduce that generally, one would need to have human scores or labels of some sort on the word level in the training data, but nothing additional in the live system?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

bittlingmayer commented 8 years ago

Aha. I would have been surprised if it didn't help, but always worth confirming assumptions.

So one can understand 0.07 MAE as a goal / lower bound?

ghpaetzold commented 8 years ago

I would not say that it is equivalent to a lower bound estimated on human judgments, but yes, you can still use it as a "near perfect" performance reference. :)


Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield

Date: Thu, 21 Apr 2016 22:23:40 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: ghpaetzold@outlook.com; comment@noreply.github.com Subject: Re: [ghpaetzold/questplusplus] Oracle word level labels (#28)

Aha. I would have been surprised if it didn't help, but always worth confirming assumptions.

So one can understand 0.07 MAE as a goal / lower bound?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

znavoyan commented 8 years ago

Hello Gustavo, thank you for the response. I didn't tried to include word-level features for sentence quality estimation. As I said earlier I have achieved MAE 0.15 as described in your article using 17 baseline features. However when I tried to draw histograms of both actual and predicted HTER scores, the distributions were very different from each other (please find them below). Correlation coefficient is also low ~0.2. Have you tried to calculate correlation coefficient and histograms for case when you use oracle word-level labels (where you got 0.07 MAE) are they better compared to this case? Or may be I did something wrong? hist

ghpaetzold commented 8 years ago

Hello. The difference in the HTER distributions is quite noticeable, and very strange. Have you checked to see if there is a difference in the distributions of training/dev/test sets? And no, we did not go as far as to study the predicted HTER ditributions of our oracle system, unfortunately.


Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield

Date: Fri, 22 Apr 2016 04:06:17 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: ghpaetzold@outlook.com; comment@noreply.github.com Subject: Re: [ghpaetzold/questplusplus] Oracle word level labels (#28)

Hello Gustavo, thank you for the response. I didn't tried to include word-level features for sentence quality estimation. As I said earlier I have achieved MAE 0.15 as described in your article using 17 baseline features. However when I tried to draw histograms of both actual and predicted HTER scores, the distributions were very different from each other (please find them below). Correlation coefficient is also low ~0.2. Have you tried to calculate correlation coefficient and histograms for case when you use oracle word-level labels (where you got 0.07 MAE) are they better compared to this case? Or may be I did something wrong?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

znavoyan commented 8 years ago

Below please find histograms for training and dev datasets. They are quite similar. Test dataset do not contain scores. I have tried to train a model using svm, random forest and even deeplearning. In all cases, the MAE is approximately the same (histograms of svm and randomforest predictions are very similar to each other), which probably means that the problem is not with machine learning part, but with used features. I think, that machine learning just learned the median value, which ensures the mean absolute error to be minimal. The distribution of predicted values are seemingly very much like a normal distribution. What you think? hist1

kashifshah commented 8 years ago

Hello,

Just for information, MAE and RMSE scores have recently been proved NOT to be very good metrics to be used to asses QE systems (specially with HTER labels/predictions). Instead it has been shown that pearson's correlation works much better and more suitable to evaluate your system. There is a ACL paper which has won best paper award in 2015, it discusses the issues on evaluating machine translation quality estimation. Here is the link:

https://www.computing.dcu.ie/~ygraham/graham-acl15.pdf

Also, this year's WMT16 QE shared task, the primary metric is pearson's correlation to rank the system instead of MAE.

Best, Kashif

On Fri, Apr 22, 2016 at 2:39 PM, znavoyan notifications@github.com wrote:

Below please find histograms for training and dev datasets. They are quite similar. Test dataset do not contain scores. I have tried to train a model using svm, random forest and even deeplearning. In all cases, the MAE is approximately the same (histograms of svm and randomforest predictions are very similar to each other), which probably means that the problem is not with machine learning part, but with used features. I think, that machine learning just learned the median value, which ensures the mean absolute error to be minimal. The distribution of predicted values are seemingly very much like a normal distribution. What you think? [image: hist1] https://cloud.githubusercontent.com/assets/5467926/14742672/7ac7cd1a-08ae-11e6-941e-7f14e2b16918.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ghpaetzold/questplusplus/issues/28#issuecomment-213431365

znavoyan commented 8 years ago

Dear Kashif, thank you for the article, it explains thoroughly about MAE and correlation coefficient. Moreover their distributions are very familiar the one that I got. The maximal correlation coefficient that I noticed is 0.595. Do you know what is the best correlation coefficient ever reported? What are the ways of improvement of the results that community currently works on? I guess collecting larger amount of data, narrowing domain, new kind feature development?

Regards, Zaven.

kashifshah commented 8 years ago

Hi Zaven,

The maximal correlation coefficient that I noticed is 0.595. Do you know

what is the best correlation coefficient ever reported?

That figure seems pretty good but without knowing your test set size and labels (you are predicting), it cannot be said conclusively. We have reported upto 0.64 for WMT12 shared task in one of our paper.

What are the ways of improvement of the results that community currently works on? I guess collecting larger amount of data, narrowing domain, new kind feature development?

You have covered more or less all areas of improvement. :-) I will add some bits:

  • larger amount of "good quality" data
  • good and in-domain resources to extract features
  • feature selection
  • better learning algorithm

To get a better idea, please have a look on previous WMT QE shared task papers.

Best, Kashif

znavoyan commented 8 years ago

Hi Kashif,

Thank you for clarification.

Regards, Zaven