k4black / codebleu

Pip compatible CodeBLEU metric implementation available for linux/macos/win
https://pypi.org/project/codebleu/
MIT License
66 stars 12 forks source link

ngram match vs weighted ngram match #52

Open Fritz-D opened 5 months ago

Fritz-D commented 5 months ago

Expected behaviour for these two metrics would be that the weighted ngram match score with a weight of 1 for all ngrams would be equivalent to the unweighted ngram match score. This is not the case: There are following issues:

  1. Brevity Penalty: Your 'references' object in the weighted ngram match score includes a list of references and a list of weights. As you pass this wholecloth into the closest closest_ref_length function, which then treats the list of refs as one ref and the list of weights as another, the brevity penalty degenerates.
  2. Denominator: In the unweighted score calculation, you use 'counts', i.e. the length of the hypothesis to calculate the denominator, in the weighted calculation, you use reference_counts, i.e. the length of the reference. This leads to a difference in scores.
  3. Counts: In the unweighted score calculation, you use a max counts object to compare the hypothesis to the set of references. In the weighted score calculation, you compare the hypothesis with individual reference then average later.

Consider merging the two calculations into one fixed version, wherein the default value for weights is simply 1 across the board.

Fritz-D commented 5 months ago

While you're at it you might as well use the bitwise Counter and operator (&) to calculate clipped counts rather than doing it manually