mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.07k stars 164 forks source link

Error when creating BLEU object with flores200 tokenizer: HTTP 403 Forbidden #274

Open williammtan opened 1 week ago

williammtan commented 1 week ago

Description:
I encountered an HTTP 403 error when attempting to create a BLEU object using the flores200 tokenizer in SacreBLEU.

Steps to Reproduce:

  1. Create a BLEU object with tokenize="flores200".
  2. Run the script.

Error Message:

File "/Users/williamtan/Projects/indonesiaku-benchmarking/benchmark.py", line 33, in __init__
    "bleu": BLEU(tokenize="flores200"),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  ...
  File "/Users/williamtan/miniconda3/envs/ai_scientist/lib/python3.11/site-packages/sacrebleu/utils.py", line 430, in download_file
    with urllib.request.urlopen(source_path) as f, open(dest_path, 'wb') as out:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib.error.HTTPError: HTTP Error 403: Forbidden

Cause:
The error occurs because the flores200 tokenizer URL uses a tinyurl link, which is not accessible via urllib.request.urlopen due to HTTP 403 restrictions.

Proposed Solution:
To resolve this issue, update the flores200 URL in /tokenizers/tokenizer_spm.py:

"flores200": {
    "url": "https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/flores200_sacrebleu_tokenizer_spm.model",
    "signature": "flores200",
}

Thank you for your help!

martinpopel commented 1 week ago

I checked it now (sacrebleu -tok flores200 example.ref < example.trans) and everything works fine for me: SacreBLEU downloads the spm model automatically via the tinyurl link. In stderr logs, I see

sacreBLEU: Downloading https://tinyurl.com/flores200sacrebleuspm to /home/martin/.sacrebleu/models/flores200sacrebleuspm

I even tried

import urllib.request
with urllib.request.urlopen("https://tinyurl.com/flores200sacrebleuspm") as f, open("flores200sacrebleuspm", 'wb') as out:
    out.write(f.read())

and it works fine.

I confirm it works (and downloads the same file) even if I substitute https://tinyurl.com/flores200sacrebleuspm with https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/flores200_sacrebleu_tokenizer_spm.model.

The question is which URL is more stable. The official Flores200 README uses https://tinyurl.com/flores200sacrebleuspm, so I guess that is meant as the permanent link, while its target may change (and the fbaipublicfiles.com link may not work in future).

I am thus a bit reluctant to change the URL in tokenizers/tokenizer_spm.py. That said, if either

  1. there are more users who cannot use the tinyurl link or
  2. if you provide some evidence that the fbaipublicfiles.com link can be considered permanent

then please make a PR and I will accept it. (In the case "1", we will have to update the url each time the tinyurl alias changes its target).

martinpopel commented 1 week ago

@williammtan Can you try downloading https://tinyurl.com/flores200sacrebleuspm once again (both with urllib.request.urlopen and another method e.g. wget/curl)? Can you try that with another tinyurl link? Maybe you are behind a firewall which blocks the whole tinyurl.com.

Yet another alternative would be to catch the exception when the download fails and try to use e.g. https://unshorten.it/ to get the target URL and try to download that instead, but I don't like such solution much as it adds code not related to sacrebleu.