huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
https://huggingface.co/docs/evaluate
Apache License 2.0
1.98k stars 255 forks source link

Evaluation of form feed symbol with BLEU results in error #601

Open lowlypalace opened 3 months ago

lowlypalace commented 3 months ago

Hi, I'm generating LLM sequences with some of the HF models such as pythia-1.4b. Some of my generations result in a sequence consisting only of form feed token, which is 12th ASCII character.

from evaluate import load

bleu = load("bleu")

prediction = "hello"
reference = chr(12)

bleu_score = bleu.compute(
    predictions=[prediction], references=[[reference]]
)["bleu"]

The following code results in an error:

ZeroDivisionError                         Traceback (most recent call last)
[<ipython-input-1-8625f8bf1df7>](https://localhost:8080/#) in <cell line: 8>()
      6 reference = chr(12)
      7 
----> 8 bleu_score = bleu.compute(
      9     predictions=[prediction], references=[[reference]]
     10 )["bleu"]

2 frames
[/usr/local/lib/python3.10/dist-packages/evaluate/module.py](https://localhost:8080/#) in compute(self, predictions, references, **kwargs)
    465             inputs = {input_name: self.data[input_name] for input_name in self._feature_names()}
    466             with temp_seed(self.seed):
--> 467                 output = self._compute(**inputs, **compute_kwargs)
    468 
    469             if self.buf_writer is not None:

[~/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bleu/9e0985c1200e367cce45605ce0ecb5ede079894e0f24f54613fca08eeb8aff76/bleu.py](https://localhost:8080/#) in _compute(self, predictions, references, tokenizer, max_order, smooth)
    120         references = [[tokenizer(r) for r in ref] for ref in references]
    121         predictions = [tokenizer(p) for p in predictions]
--> 122         score = compute_bleu(
    123             reference_corpus=references, translation_corpus=predictions, max_order=max_order, smooth=smooth
    124         )

[~/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bleu/9e0985c1200e367cce45605ce0ecb5ede079894e0f24f54613fca08eeb8aff76/nmt_bleu.py](https://localhost:8080/#) in compute_bleu(reference_corpus, translation_corpus, max_order, smooth)
    101     geo_mean = 0
    102 
--> 103   ratio = float(translation_length) / reference_length
    104 
    105   if ratio > 1.0:

ZeroDivisionError: float division by zero

The expected behaviour would be that the score should still be computed for this character even though this is a non-printable character. I believe this will happen with other non-printable characters. Is this an intended behaviour?