Closed martinpopel closed 9 years ago
Which smoothing should be used? Smoothing will be probably needed for sentence level metrics.
The advantage of arithmetic mean (over geometric) is that it is non-zero if at least one unigram is matching.
Of course, one could still use some smoothing, but my guess is that DFKI uses no smoothing for F-score (even when used as a sentence-level metric).
Martin
----- Original Message -----
From: "choko" notifications@github.com To: "choko/MT-ComparEval" MT-ComparEval@noreply.github.com Cc: "Martin Popel" popel@ufal.mff.cuni.cz Sent: Friday, June 19, 2015 11:11:40 AM Subject: Re: [MT-ComparEval] Arithmetic-mean F-score (#4)
Which smoothing should be used? Smoothing will be probably needed for sentence level metrics.
Reply to this email directly or view it on GitHub: https://github.com/choko/MT-ComparEval/issues/4#issuecomment-113440816
Great, less work for me ;-) Thanks.
there is no smoothing at all in rgbF
and no need (with arithmetic mean)
btw AM showed better correlations with human rankings than GM
On 06/19/2015 11:22 AM, Martin Popel wrote:
The advantage of arithmetic mean (over geometric) is that it is non-zero if at least one unigram is matching.
Of course, one could still use some smoothing, but my guess is that DFKI uses no smoothing for F-score (even when used as a sentence-level metric).
Martin
----- Original Message -----
From: "choko" notifications@github.com To: "choko/MT-ComparEval" MT-ComparEval@noreply.github.com Cc: "Martin Popel" popel@ufal.mff.cuni.cz Sent: Friday, June 19, 2015 11:11:40 AM Subject: Re: [MT-ComparEval] Arithmetic-mean F-score (#4) Which smoothing should be used? Smoothing will be probably needed for sentence level metrics.
Reply to this email directly or view it on GitHub: https://github.com/choko/MT-ComparEval/issues/4#issuecomment-113440816
Dr.-Ing. Maja Popović DFKI GmbH, Alt-Moabit 91c, 10559 Berlin Tel. (+49) 30 3949 1841
--------------- Legal Note --------------- Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster(Vorsitzender), Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313
Done in 171daa6dad536a8dbbc37fdde47e8f398313d016. Please, can you check that the implementation is correct? After that, I will merge into master and add some infrastructure amendments.
I thought that @lefterav will check the scores, but probably he is too busy now and we need to release this fix. In addition to 171daa6dad536a8dbbc37fdde47e8f398313d016, I suggest to reimplement also Precision and Recall to use the arithmetic mean. So there will be 4 metrics: BLEU, Precision, Recall, F-Measure=wordF. (Probably we don't need to use the name wordF unless we implement also the character-based charF). To motivation behind is to comply with the description in our paper.
Ok, I can change the implementation to use arithmetic mean. Unfortunately, Bleu depends on Precision so we will need two separate implementations of Precision. But that's not a problem.
Well, Precision.php is used in Bleu.php, where we need to keep the geometric mean. So we will need GPrecision.php and APrecision.php for geometric and arithmetic mean.
Same idea in the same second:-). An alternative would be to use parameters, but I am not sure if this is possible in the current design.
Ok, I will do it later in the evening. Should I also reimport WMT tasks with new metrics?
Yes, reimport, but I have another issue regarding BLEU in the queue, so wait please.
OK, now when #25 is fixed, you can reimport WMT (I hope there will be no more problems, but I will check it on the WMT datasets).
Done in c755ca4b750e3ad14d8fe5e86e9ab504af8195a0 and f68db55d126485eb0f0968de8126d110a25ba94c.
I will reimport WMT tommorow, because I can't connect to the server right now.
The current "F-measure" metric seems to use geometric mean of 1-grams . .4-grams. However, we need a metric (called "wordF" as there can/will be also "charF" based on character n-grams) that is an arithmetic mean of F-scores of 1-grams . .4-grams. As usual, F-score is a harmonic mean of precision and recall.