Open williammtan opened 1 week ago
I checked it now (sacrebleu -tok flores200 example.ref < example.trans
) and everything works fine for me: SacreBLEU downloads the spm model automatically via the tinyurl link. In stderr logs, I see
sacreBLEU: Downloading https://tinyurl.com/flores200sacrebleuspm to /home/martin/.sacrebleu/models/flores200sacrebleuspm
I even tried
import urllib.request
with urllib.request.urlopen("https://tinyurl.com/flores200sacrebleuspm") as f, open("flores200sacrebleuspm", 'wb') as out:
out.write(f.read())
and it works fine.
I confirm it works (and downloads the same file) even if I substitute https://tinyurl.com/flores200sacrebleuspm
with https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/flores200_sacrebleu_tokenizer_spm.model
.
The question is which URL is more stable. The official Flores200 README uses https://tinyurl.com/flores200sacrebleuspm
, so I guess that is meant as the permanent link, while its target may change (and the fbaipublicfiles.com link may not work in future).
I am thus a bit reluctant to change the URL in tokenizers/tokenizer_spm.py
. That said, if either
then please make a PR and I will accept it. (In the case "1", we will have to update the url each time the tinyurl alias changes its target).
@williammtan Can you try downloading https://tinyurl.com/flores200sacrebleuspm
once again (both with urllib.request.urlopen
and another method e.g. wget/curl)?
Can you try that with another tinyurl link?
Maybe you are behind a firewall which blocks the whole tinyurl.com.
Yet another alternative would be to catch the exception when the download fails and try to use e.g. https://unshorten.it/ to get the target URL and try to download that instead, but I don't like such solution much as it adds code not related to sacrebleu.
Description:
I encountered an HTTP 403 error when attempting to create a BLEU object using the
flores200
tokenizer in SacreBLEU.Steps to Reproduce:
tokenize="flores200"
.Error Message:
Cause:
The error occurs because the
flores200
tokenizer URL uses atinyurl
link, which is not accessible viaurllib.request.urlopen
due to HTTP 403 restrictions.Proposed Solution:
To resolve this issue, update the
flores200
URL in/tokenizers/tokenizer_spm.py
:Thank you for your help!