Arithmetic-mean F-score

martinpopel commented 9 years ago

The current "F-measure" metric seems to use geometric mean of 1-grams . .4-grams. However, we need a metric (called "wordF" as there can/will be also "charF" based on character n-grams) that is an arithmetic mean of F-scores of 1-grams . .4-grams. As usual, F-score is a harmonic mean of precision and recall.

ondrejklejch commented 9 years ago

Which smoothing should be used? Smoothing will be probably needed for sentence level metrics.

martinpopel commented 9 years ago

The advantage of arithmetic mean (over geometric) is that it is non-zero if at least one unigram is matching.

Of course, one could still use some smoothing, but my guess is that DFKI uses no smoothing for F-score (even when used as a sentence-level metric).

Martin

----- Original Message -----

From: "choko" notifications@github.com To: "choko/MT-ComparEval" MT-ComparEval@noreply.github.com Cc: "Martin Popel" popel@ufal.mff.cuni.cz Sent: Friday, June 19, 2015 11:11:40 AM Subject: Re: [MT-ComparEval] Arithmetic-mean F-score (#4)

Which smoothing should be used? Smoothing will be probably needed for sentence level metrics.

Reply to this email directly or view it on GitHub: https://github.com/choko/MT-ComparEval/issues/4#issuecomment-113440816

ondrejklejch commented 9 years ago

Great, less work for me ;-) Thanks.

martinpopel commented 9 years ago

there is no smoothing at all in rgbF

and no need (with arithmetic mean)

btw AM showed better correlations with human rankings than GM

On 06/19/2015 11:22 AM, Martin Popel wrote:

The advantage of arithmetic mean (over geometric) is that it is non-zero if at least one unigram is matching.

Of course, one could still use some smoothing, but my guess is that DFKI uses no smoothing for F-score (even when used as a sentence-level metric).

Martin

----- Original Message -----

From: "choko" notifications@github.com To: "choko/MT-ComparEval" MT-ComparEval@noreply.github.com Cc: "Martin Popel" popel@ufal.mff.cuni.cz Sent: Friday, June 19, 2015 11:11:40 AM Subject: Re: [MT-ComparEval] Arithmetic-mean F-score (#4) Which smoothing should be used? Smoothing will be probably needed for sentence level metrics.

Reply to this email directly or view it on GitHub: https://github.com/choko/MT-ComparEval/issues/4#issuecomment-113440816

Dr.-Ing. Maja Popović DFKI GmbH, Alt-Moabit 91c, 10559 Berlin Tel. (+49) 30 3949 1841

--------------- Legal Note --------------- Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster(Vorsitzender), Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313

ondrejklejch commented 9 years ago

Done in 171daa6dad536a8dbbc37fdde47e8f398313d016. Please, can you check that the implementation is correct? After that, I will merge into master and add some infrastructure amendments.

martinpopel commented 9 years ago

I thought that @lefterav will check the scores, but probably he is too busy now and we need to release this fix. In addition to 171daa6dad536a8dbbc37fdde47e8f398313d016, I suggest to reimplement also Precision and Recall to use the arithmetic mean. So there will be 4 metrics: BLEU, Precision, Recall, F-Measure=wordF. (Probably we don't need to use the name wordF unless we implement also the character-based charF). To motivation behind is to comply with the description in our paper.

ondrejklejch commented 9 years ago

Ok, I can change the implementation to use arithmetic mean. Unfortunately, Bleu depends on Precision so we will need two separate implementations of Precision. But that's not a problem.

martinpopel commented 9 years ago

Well, Precision.php is used in Bleu.php, where we need to keep the geometric mean. So we will need GPrecision.php and APrecision.php for geometric and arithmetic mean.

martinpopel commented 9 years ago

Same idea in the same second:-). An alternative would be to use parameters, but I am not sure if this is possible in the current design.

ondrejklejch commented 9 years ago

Ok, I will do it later in the evening. Should I also reimport WMT tasks with new metrics?

martinpopel commented 9 years ago

Yes, reimport, but I have another issue regarding BLEU in the queue, so wait please.

martinpopel commented 9 years ago

OK, now when #25 is fixed, you can reimport WMT (I hope there will be no more problems, but I will check it on the WMT datasets).

ondrejklejch commented 9 years ago

Done in c755ca4b750e3ad14d8fe5e86e9ab504af8195a0 and f68db55d126485eb0f0968de8126d110a25ba94c.

I will reimport WMT tommorow, because I can't connect to the server right now.

ondrejklejch / MT-ComparEval

Arithmetic-mean F-score #4