[Feature Request] HTER Implementation

SacreBLEU includes an implementation of TER, using -m ter. The implementation of HTER is exactly the same, you just need to use "targeted" references for the MT system you plan to evaluate (i.e. human post-edited the MT output, possibly using existing untargeted references). If you need to strictly follow the original HTER paper, you should also have a set of untargeted references and multiply the final score by the avg length of the targeted reference and divide by the avg length of the untargeted references. Note that HTER computation is very costly because you need to create a new targeted reference for (each version of) each MT system you plan to evaluate. If you want fairly compare several MT systems, you should create their targeted references at the same time with the same pool of annotators and make sure the assignment of annotators is random. Note also that HTER was invented before the introduction of modern NMT systems, so we don't know what would be the correlation with human judgements. Also, it is well known that some systems have worse translation quality but need less edits post-editing relative to other systems, so HTER would be biased against these systems (similarly to BLEU).

mjpost / sacrebleu

[Feature Request] HTER Implementation #248