Closed pgfeldman closed 4 years ago
I cannot reproduce this. This works for me (same environment except Python 3.8 which should not make a difference). Can you try again but force_overwrite potentially corrupt files?
tok = MarianTokenizer.from_pretrained(mname, force_download=True)
Hi,
I rebased the transformers project just before running this and updated with "pip install --upgrade ." in the root transformers directory.
Here is the code as run:
from transformers import MarianTokenizer, MarianMTModel from typing import List src = 'fr' # source language trg = 'en' # target language sample_text = "oΓΉ est l'arrΓͺt de bus ?" mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'
model = MarianMTModel.from_pretrained(mname, force_download=True) tok = MarianTokenizer.from_pretrained(mname, force_download=True)
don't need tgt_text for inference
returns "Where is the the bus stop ?"
Here is the terminal output:
2020-05-22 05:45:15.204824: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll Downloading: 100%|ββββββββββ| 1.13k/1.13k [00:00<00:00, 568kB/s] Downloading: 100%|ββββββββββ| 301M/301M [00:32<00:00, 9.34MB/s] Downloading: 100%|ββββββββββ| 802k/802k [00:00<00:00, 5.85MB/s] Downloading: 100%|ββββββββββ| 778k/778k [00:00<00:00, 5.71MB/s] Downloading: 100%|ββββββββββ| 1.34M/1.34M [00:00<00:00, 6.69MB/s] Downloading: 100%|ββββββββββ| 42.0/42.0 [00:00<00:00, 13.8kB/s] stdbuf was not found; communication with perl may hang due to stdio buffering. Traceback (most recent call last): File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1055, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 89, in init self._setup_normalizer() File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 95, in _setup_normalizer self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang) File "C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py", line 47, in init super().init(argv) File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 64, in init self.start() File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 108, in start env=env, File "C:\Program Files\Python\lib\subprocess.py", line 709, in init restore_signals, start_new_session) File "C:\Program Files\Python\lib\subprocess.py", line 997, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"C:/Development/Research/COVID-19-Misinfo2/src/translate_test_2.py",
line 9, in
Process finished with exit code 1
I also tried this with 'Helsinki-NLP/opus-mt-ROMANCE-en' and had the same results. I also stepped through the code in the debugger and manually downloaded the files using my browser and pointed the *.from_retrained() methods to that directory. Here is the relevant code:
model_name = 'Helsinki-NLP/opus-mt-ROMANCE-en'
model = MarianMTModel.from_pretrained("./models/opus-mt-ROMANCE-en/model")
tokenizer = MarianTokenizer.from_pretrained("./models/opus-mt-ROMANCE-en/model")
And here is the directory list. I've also attached all these files except the pytorch.model.bin. If there is a problem with these files, please send me the correct ones and I can try this locally
Directory:
C:\Development\Research\COVID-19-Misinfo2\src\models\opus-mt-ROMANCE-en\model
Mode LastWriteTime Length Name
-a---- 5/20/2020 5:52 PM 1163 config.json -a---- 5/20/2020 5:52 PM 312086495 pytorch_model.bin -a---- 5/20/2020 6:05 PM 800087 source.spm -a---- 5/20/2020 6:08 PM 265 tokenizer_config.json -a---- 5/20/2020 6:07 PM 1460304 vocab.json
This had the same effect as the remote download
2020-05-22 05:58:34.251856: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44]
Successfully opened dynamic library cudart64_101.dll
dir = C:\Development\Research\COVID-19-Misinfo2\src
Traceback (most recent call last):
File
"C:/Development/Research/COVID-19-Misinfo2/src/translate_test_1.py",
line 15, in
Process finished with exit code 1
I have downloaded and used the GPT-2 model without these problems using very similar code
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
Hope this helps,
Phil Feldman
On 2020-05-22 05:34, Bram Vanroy wrote:
I cannot reproduce this. This works for me (same environment except Python 3.8 which should not make a difference). Can you try again but force_overwrite potentially corrupt files?
tok = MarianTokenizer.from_pretrained(mname, force_download=True)
-- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].
[1] https://github.com/huggingface/transformers/issues/4491#issuecomment-632597198 [2] https://github.com/notifications/unsubscribe-auth/ABPRJH7JIRH4PIEONBAXAULRSZBJXANCNFSM4NGLYESA
Hi @pgfeldman, I initally faced the same error but was able to resolve it by downloading the model to a specified location using the below steps
cache_dir = "/home/transformers_files/"
cache_dir_models = cache_dir + "default_models/"
cache_dir_tokenizers = cache_dir + "tokenizers/"
model_name = 'Helsinki-NLP/opus-mt-ROMANCE-en'
tokenizer = MarianTokenizer.from_pretrained(model_name, cache_dir=cache_dir_tokenizers)
model = MarianMTModel.from_pretrained(model_name, cache_dir=cache_dir_models)
Hi! I had the same issue after installing the mosestokenizer (as recommended) on Windows with Python 3.6. After I uninstalled it, it seemed to work fine! I think more investigation is needed there.
@BramVanroy did it work for you on windows? I also can't reproduce.
@BramVanroy did it work for you on windows? I also can't reproduce.
I still cannot reproduce this. I tried uninstall/reinstalling mosestokenizer and it works in both cases.
For everyone having problems, can you run the following and post its output here so that we can find similarities? @jpcorb20 @SAswinGiridhar @pgfeldman
This requires you to be on the latest master branch (on Windows at least) so install from source!
transformers-cli env
I deleted and re-installed transformers and installed from source
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 2.11.0 I'm also attaching my package list [deleted by moderator for length]
Hello, here's mine :
transformers
version: 2.11.0Does
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
tokenizer.batch_encode_plus(['stuff'])
work?
Yes!
Here's the code as run:
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base") tokenizer.batch_encode_plus(['stuff'])
print("done")
Here's the output
"C:\Program Files\Python\python.exe" C:/Users/Phil/AppData/Roaming/JetBrains/IntelliJIdea2020.1/scratches/transformers_error_2.py
2020-06-08 17:44:17.768004: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll Downloading: 100%|ββββββββββ| 5.07M/5.07M [00:00<00:00, 9.57MB/s] done
Process finished with exit code 0
Hope this helps,
Phil
On 2020-06-08 17:13, Sam Shleifer wrote:
Does
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base") tokenizer.batch_encode_plus(['stuff'])
work?
-- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].
[1] https://github.com/huggingface/transformers/issues/4491#issuecomment-640889916 [2] https://github.com/notifications/unsubscribe-auth/ABPRJHZZ3BPH7DFC36FOYJTRVVIAVANCNFSM4NGLYESA
Working for me too
Can anyone help with this issue: #5040 ?
Can anyone help with this issue: #5040 ?
Please don't spam other topics like this in the future. We do our best to help where and when we can. Posting duplicate comments on different topics adds more noise than it is helpful.
I think this bug may be fixed on master, but I can't verify because I don't have windows. Could 1 person check and post their results? Remember to be up to date with master, your git log should contain 3d495c61e Sam Shleifer: Fix marian tokenizer save pretrained (#5043)
Doesn't work on my PC, but I changed the library for the moses tokenizer in _setup_normalizer and it works:
def _setup_normalizer(self):
try:
from sacremoses import MosesPunctNormalizer
self.punc_normalizer = MosesPunctNormalizer(lang=self.source_lang).normalize
except ImportError:
warnings.warn("Recommended: pip install sacremoses")
self.punc_normalizer = lambda x: x
Hi Sam,
I just rebased, verified the gitlog, and installed using "pip install --upgrade ." I'm attaching the console record of the install.
I still get the same error(s)
2020-06-17 05:40:43.980254: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll stdbuf was not found; communication with perl may hang due to stdio buffering. Traceback (most recent call last): File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py", line 1161, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 81, in init self._setup_normalizer() File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 87, in _setup_normalizer self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang) File "C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py", line 47, in init super().init(argv) File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 64, in init self.start() File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 108, in start env=env, File "C:\Program Files\Python\lib\subprocess.py", line 709, in init restore_signals, start_new_session) File "C:\Program Files\Python\lib\subprocess.py", line 997, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"C:/Users/Phil/AppData/Roaming/JetBrains/IntelliJIdea2020.1/scratches/transformers_error.py",
line 9, in
Process finished with exit code 1
Hope this helps
Phil
On 2020-06-16 09:50, Sam Shleifer wrote:
I think this bug may be fixed on master, but I can't verify because I don't have windows. Could 1 person check and post their results? Remember to be up to date with master, your git log should contain 3d495c61e Sam Shleifer: Fix marian tokenizer save pretrained (#5043) - (HEAD -> master, upstream/master) (2 minutes ago)
-- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].
[1] https://github.com/huggingface/transformers/issues/4491#issuecomment-644778862 [2] https://github.com/notifications/unsubscribe-auth/ABPRJH5BKWN3OBT7DOP4PVTRW52CRANCNFSM4NGLYESA
Just upgraded to version 3.0, and everything is working!
cache_dir=cache_dir_models
This works
Hi @pgfeldman, I initally faced the same error but was able to resolve it by downloading the model to a specified location using the below steps
cache_dir = "/home/transformers_files/" cache_dir_models = cache_dir + "default_models/" cache_dir_tokenizers = cache_dir + "tokenizers/" model_name = 'Helsinki-NLP/opus-mt-ROMANCE-en' tokenizer = MarianTokenizer.from_pretrained(model_name, cache_dir=cache_dir_tokenizers) model = MarianMTModel.from_pretrained(model_name, cache_dir=cache_dir_models)
this works! Thanks.
π Bug MarianTokenizer.from_pretrained() fails in Python 3.6.4 in Windows 10
Information
Occurs with using the example here: https://huggingface.co/transformers/model_doc/marian.html?highlight=marianmtmodel#transformers.MarianMTModel
Model I am using (Bert, XLNet ...): MarianMTModel
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
To reproduce
Paste code from example and run:
Steps to reproduce the behavior:
tok = MarianTokenizer.from_pretrained(mname)
Expected behavior
prints ["Where is the the bus stop ?"]
Environment info
transformers
version: 2.9.1