huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.75k stars 26.95k forks source link

Windows: Can't find vocabulary file for MarianTokenizer #4491

Closed pgfeldman closed 4 years ago

pgfeldman commented 4 years ago

πŸ› Bug MarianTokenizer.from_pretrained() fails in Python 3.6.4 in Windows 10

Information

Occurs with using the example here: https://huggingface.co/transformers/model_doc/marian.html?highlight=marianmtmodel#transformers.MarianMTModel

Model I am using (Bert, XLNet ...): MarianMTModel

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

The tasks I am working on is:

To reproduce

Paste code from example and run:

from transformers import MarianTokenizer, MarianMTModel
from typing import List
src = 'fr'  # source language
trg = 'en'  # target language
sample_text = "oΓΉ est l'arrΓͺt de bus ?"
mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'

model = MarianMTModel.from_pretrained(mname)
tok = MarianTokenizer.from_pretrained(mname)
batch = tok.prepare_translation_batch(src_texts=[sample_text])  # don't need tgt_text for inference
gen = model.generate(**batch)  # for forward pass: model(**batch)
words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)  # returns "Where is the the bus stop ?"
print(words)

Steps to reproduce the behavior:

  1. Run the example
  2. Program terminates on tok = MarianTokenizer.from_pretrained(mname)
stdbuf was not found; communication with perl may hang due to stdio buffering.
Traceback (most recent call last):
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1055, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 89, in __init__
    self._setup_normalizer()
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 95, in _setup_normalizer
    self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)
  File "C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py", line 47, in __init__
    super().__init__(argv)
  File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 64, in __init__
    self.start()
  File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 108, in start
    env=env,
  File "C:\Program Files\Python\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "C:\Program Files\Python\lib\subprocess.py", line 997, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Development/Research/COVID-19-Misinfo2/src/translate_test_2.py", line 9, in <module>
    tok = MarianTokenizer.from_pretrained(mname)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 902, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1058, in _from_pretrained
    "Unable to load vocabulary from file. "
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Expected behavior

prints ["Where is the the bus stop ?"]

Environment info

BramVanroy commented 4 years ago

I cannot reproduce this. This works for me (same environment except Python 3.8 which should not make a difference). Can you try again but force_overwrite potentially corrupt files?

tok = MarianTokenizer.from_pretrained(mname, force_download=True)
pgfeldman commented 4 years ago

Hi,

I rebased the transformers project just before running this and updated with "pip install --upgrade ." in the root transformers directory.

Here is the code as run:

from transformers import MarianTokenizer, MarianMTModel from typing import List src = 'fr' # source language trg = 'en' # target language sample_text = "oΓΉ est l'arrΓͺt de bus ?" mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'

model = MarianMTModel.from_pretrained(mname, force_download=True) tok = MarianTokenizer.from_pretrained(mname, force_download=True)

batch = tok.prepare_translation_batch(src_texts=[sample_text])

don't need tgt_text for inference

gen = model.generate(batch) # for forward pass: model(batch)

words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)

returns "Where is the the bus stop ?"

Here is the terminal output:

2020-05-22 05:45:15.204824: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.13k/1.13k [00:00<00:00, 568kB/s] Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 301M/301M [00:32<00:00, 9.34MB/s] Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 802k/802k [00:00<00:00, 5.85MB/s] Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 778k/778k [00:00<00:00, 5.71MB/s] Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.34M/1.34M [00:00<00:00, 6.69MB/s] Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 42.0/42.0 [00:00<00:00, 13.8kB/s] stdbuf was not found; communication with perl may hang due to stdio buffering. Traceback (most recent call last): File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1055, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 89, in init self._setup_normalizer() File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 95, in _setup_normalizer self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang) File "C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py", line 47, in init super().init(argv) File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 64, in init self.start() File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 108, in start env=env, File "C:\Program Files\Python\lib\subprocess.py", line 709, in init restore_signals, start_new_session) File "C:\Program Files\Python\lib\subprocess.py", line 997, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:/Development/Research/COVID-19-Misinfo2/src/translate_test_2.py", line 9, in tok = MarianTokenizer.from_pretrained(mname, force_download=True) File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 902, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1058, in _from_pretrained "Unable to load vocabulary from file. " OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Process finished with exit code 1

I also tried this with 'Helsinki-NLP/opus-mt-ROMANCE-en' and had the same results. I also stepped through the code in the debugger and manually downloaded the files using my browser and pointed the *.from_retrained() methods to that directory. Here is the relevant code:

model_name = 'Helsinki-NLP/opus-mt-ROMANCE-en'

see tokenizer.supported_language_codes for choices

model = MarianMTModel.from_pretrained("./models/opus-mt-ROMANCE-en/model")

model.save_pretrained("./models/opus-mt-ROMANCE-en/model")

tokenizer = MarianTokenizer.from_pretrained("./models/opus-mt-ROMANCE-en/model")

tokenizer.save_pretrained("./models/opus-mt-ROMANCE-en/tokenizer")

And here is the directory list. I've also attached all these files except the pytorch.model.bin. If there is a problem with these files, please send me the correct ones and I can try this locally

Directory:

C:\Development\Research\COVID-19-Misinfo2\src\models\opus-mt-ROMANCE-en\model

Mode LastWriteTime Length Name


-a---- 5/20/2020 5:52 PM 1163 config.json -a---- 5/20/2020 5:52 PM 312086495 pytorch_model.bin -a---- 5/20/2020 6:05 PM 800087 source.spm -a---- 5/20/2020 6:08 PM 265 tokenizer_config.json -a---- 5/20/2020 6:07 PM 1460304 vocab.json

This had the same effect as the remote download

2020-05-22 05:58:34.251856: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll dir = C:\Development\Research\COVID-19-Misinfo2\src Traceback (most recent call last): File "C:/Development/Research/COVID-19-Misinfo2/src/translate_test_1.py", line 15, in tokenizer = MarianTokenizer.from_pretrained("./models/opus-mt-ROMANCE-en/model") File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 902, in from_pretrained return cls._from_pretrained(*inputs, *kwargs) File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1055, in _from_pretrained tokenizer = cls(init_inputs, **init_kwargs) File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 84, in init self.spm_target = load_spm(target_spm) File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 236, in load_spm spm.Load(path) File "C:\Program Files\Python\lib\site-packages\sentencepiece.py", line 118, in Load return _sentencepiece.SentencePieceProcessor_Load(self, filename) TypeError: not a string

Process finished with exit code 1

I have downloaded and used the GPT-2 model without these problems using very similar code

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Hope this helps,

Phil Feldman


On 2020-05-22 05:34, Bram Vanroy wrote:

I cannot reproduce this. This works for me (same environment except Python 3.8 which should not make a difference). Can you try again but force_overwrite potentially corrupt files?

tok = MarianTokenizer.from_pretrained(mname, force_download=True)

-- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].

Links:

[1] https://github.com/huggingface/transformers/issues/4491#issuecomment-632597198 [2] https://github.com/notifications/unsubscribe-auth/ABPRJH7JIRH4PIEONBAXAULRSZBJXANCNFSM4NGLYESA

aswin-giridhar commented 4 years ago

Hi @pgfeldman, I initally faced the same error but was able to resolve it by downloading the model to a specified location using the below steps

cache_dir = "/home/transformers_files/"
cache_dir_models = cache_dir + "default_models/"
cache_dir_tokenizers = cache_dir + "tokenizers/"
model_name = 'Helsinki-NLP/opus-mt-ROMANCE-en'
tokenizer = MarianTokenizer.from_pretrained(model_name, cache_dir=cache_dir_tokenizers)
model = MarianMTModel.from_pretrained(model_name, cache_dir=cache_dir_models)
jpcorb20 commented 4 years ago

Hi! I had the same issue after installing the mosestokenizer (as recommended) on Windows with Python 3.6. After I uninstalled it, it seemed to work fine! I think more investigation is needed there.

sshleifer commented 4 years ago

@BramVanroy did it work for you on windows? I also can't reproduce.

BramVanroy commented 4 years ago

@BramVanroy did it work for you on windows? I also can't reproduce.

I still cannot reproduce this. I tried uninstall/reinstalling mosestokenizer and it works in both cases.

For everyone having problems, can you run the following and post its output here so that we can find similarities? @jpcorb20 @SAswinGiridhar @pgfeldman

This requires you to be on the latest master branch (on Windows at least) so install from source!

transformers-cli env
pgfeldman commented 4 years ago

I deleted and re-installed transformers and installed from source

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

I'm also attaching my package list [deleted by moderator for length]

jpcorb20 commented 4 years ago

Hello, here's mine :

sshleifer commented 4 years ago

Does

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
tokenizer.batch_encode_plus(['stuff'])

work?

pgfeldman commented 4 years ago

Yes!

Here's the code as run:

from transformers import XLMRobertaTokenizer

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base") tokenizer.batch_encode_plus(['stuff'])

print("done")

Here's the output

"C:\Program Files\Python\python.exe" C:/Users/Phil/AppData/Roaming/JetBrains/IntelliJIdea2020.1/scratches/transformers_error_2.py

2020-06-08 17:44:17.768004: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5.07M/5.07M [00:00<00:00, 9.57MB/s] done

Process finished with exit code 0

Hope this helps,

Phil


On 2020-06-08 17:13, Sam Shleifer wrote:

Does

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base") tokenizer.batch_encode_plus(['stuff'])

work?

-- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].

Links:

[1] https://github.com/huggingface/transformers/issues/4491#issuecomment-640889916 [2] https://github.com/notifications/unsubscribe-auth/ABPRJHZZ3BPH7DFC36FOYJTRVVIAVANCNFSM4NGLYESA

jpcorb20 commented 4 years ago

Working for me too

erikchwang commented 4 years ago

Can anyone help with this issue: #5040 ?

BramVanroy commented 4 years ago

Can anyone help with this issue: #5040 ?

Please don't spam other topics like this in the future. We do our best to help where and when we can. Posting duplicate comments on different topics adds more noise than it is helpful.

sshleifer commented 4 years ago

I think this bug may be fixed on master, but I can't verify because I don't have windows. Could 1 person check and post their results? Remember to be up to date with master, your git log should contain 3d495c61e Sam Shleifer: Fix marian tokenizer save pretrained (#5043)

jpcorb20 commented 4 years ago

Doesn't work on my PC, but I changed the library for the moses tokenizer in _setup_normalizer and it works:

def _setup_normalizer(self):
        try:
            from sacremoses import MosesPunctNormalizer
            self.punc_normalizer = MosesPunctNormalizer(lang=self.source_lang).normalize
        except ImportError:
            warnings.warn("Recommended: pip install sacremoses")
            self.punc_normalizer = lambda x: x
pgfeldman commented 4 years ago

Hi Sam,

I just rebased, verified the gitlog, and installed using "pip install --upgrade ." I'm attaching the console record of the install.

I still get the same error(s)

2020-06-17 05:40:43.980254: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll stdbuf was not found; communication with perl may hang due to stdio buffering. Traceback (most recent call last): File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py", line 1161, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 81, in init self._setup_normalizer() File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 87, in _setup_normalizer self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang) File "C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py", line 47, in init super().init(argv) File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 64, in init self.start() File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 108, in start env=env, File "C:\Program Files\Python\lib\subprocess.py", line 709, in init restore_signals, start_new_session) File "C:\Program Files\Python\lib\subprocess.py", line 997, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:/Users/Phil/AppData/Roaming/JetBrains/IntelliJIdea2020.1/scratches/transformers_error.py", line 9, in tok = MarianTokenizer.from_pretrained(mname) File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py", line 1008, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py", line 1164, in _from_pretrained "Unable to load vocabulary from file. " OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Process finished with exit code 1

Hope this helps

Phil


On 2020-06-16 09:50, Sam Shleifer wrote:

I think this bug may be fixed on master, but I can't verify because I don't have windows. Could 1 person check and post their results? Remember to be up to date with master, your git log should contain 3d495c61e Sam Shleifer: Fix marian tokenizer save pretrained (#5043) - (HEAD -> master, upstream/master) (2 minutes ago)

-- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].

Links:

[1] https://github.com/huggingface/transformers/issues/4491#issuecomment-644778862 [2] https://github.com/notifications/unsubscribe-auth/ABPRJH5BKWN3OBT7DOP4PVTRW52CRANCNFSM4NGLYESA

pgfeldman commented 4 years ago

Just upgraded to version 3.0, and everything is working!

IgorBar82 commented 2 months ago

cache_dir=cache_dir_models

This works

Hi @pgfeldman, I initally faced the same error but was able to resolve it by downloading the model to a specified location using the below steps

cache_dir = "/home/transformers_files/"
cache_dir_models = cache_dir + "default_models/"
cache_dir_tokenizers = cache_dir + "tokenizers/"
model_name = 'Helsinki-NLP/opus-mt-ROMANCE-en'
tokenizer = MarianTokenizer.from_pretrained(model_name, cache_dir=cache_dir_tokenizers)
model = MarianMTModel.from_pretrained(model_name, cache_dir=cache_dir_models)

this works! Thanks.