Expected behaviour for these two metrics would be that the weighted ngram match score with a weight of 1 for all ngrams would be equivalent to the unweighted ngram match score. This is not the case:
There are following issues:
Brevity Penalty: Your 'references' object in the weighted ngram match score includes a list of references and a list of weights. As you pass this wholecloth into the closest closest_ref_length function, which then treats the list of refs as one ref and the list of weights as another, the brevity penalty degenerates.
Denominator: In the unweighted score calculation, you use 'counts', i.e. the length of the hypothesis to calculate the denominator, in the weighted calculation, you use reference_counts, i.e. the length of the reference. This leads to a difference in scores.
Counts: In the unweighted score calculation, you use a max counts object to compare the hypothesis to the set of references. In the weighted score calculation, you compare the hypothesis with individual reference then average later.
Consider merging the two calculations into one fixed version, wherein the default value for weights is simply 1 across the board.
Expected behaviour for these two metrics would be that the weighted ngram match score with a weight of 1 for all ngrams would be equivalent to the unweighted ngram match score. This is not the case: There are following issues:
Consider merging the two calculations into one fixed version, wherein the default value for weights is simply 1 across the board.