Closed bhedayat closed 3 years ago
Seemed to have fixed it by following this https://github.com/huggingface/transformers/issues/9687 and using transformers 4.5.1 instead
Same problem as #12536. @LysandreJik
i got the same error for load model "bert-base-uncased"
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Is this still a problem here? I can load the tokenizer, save it and then load it again without internet connection
Both linked issues were never fixed so I would say so
On Wed, Aug 18, 2021, 6:44 PM Patrick von Platen @.***> wrote:
Is this still a problem here?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/12571#issuecomment-901266168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKLUCABJRZZD7AQL6HZDRITT5PPO7ANCNFSM477UY3MA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
A simple workaround would be to just do:
from transformers import GPT2Tokenizer
tok = GPT2Tokenizer.from_pretrained("gpt2", cache_dir="<some_directory>")
tok.save_pretrained("<some_directory>")
and loading it from there without internet, but I guess it would indeed be more userfriendly to allow this automatically once the tokenizer has been downloaded once
I digged a bit more into it in the linked issue #12536 (now stale) and the problem was that non existent files (such as the added tokens json in some of the tokenizers) caused a "breaking" exception offline but a simple warning online, or when the local files only flag was set to true. As you said, the workaround is super simple (even just setting local files only to true fixes it ) but it's just UX
In the other issue, I proposed a simple (very naive fix) as a PR that circumvented this behavior but I suspect it might break things elsewhere (and would require changing a pipeline test)
Hi everybody, I am getting the same error and after digging a bit deeper, I believe that the current caching mechanism depends on the Internet connection crucially for latest versions, e.g., 4.8.x and 4.9.2. I blame the function get_from_cache
, which IMHO shouldn't work properly unless you always have Internet. Some details are below.
Simple code to reproduce the effect:
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained('roberta-base', unk_token='<unk>')
First, specifying the caching directory doesn't help, because the function get_from_cache
computes the caching path using the so-caled etag
:
filename = url_to_filename(url, etag)
I added a code to print the filename, the url, and the etag. When Internet is there, we get:
### url: https://huggingface.co/roberta-base/resolve/main/config.json etag: "8db5e7ac5bfc9ec8b613b776009300fe3685d957" filename: 733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
### url: https://huggingface.co/roberta-base/resolve/main/vocab.json etag: "5606f48548d99a9829d10a96cd364b816b02cd21" filename: d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
### url: https://huggingface.co/roberta-base/resolve/main/merges.txt etag: "226b0752cac7789c48f0cb3ec53eda48b7be36cc" filename: cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
### url: https://huggingface.co/roberta-base/resolve/main/tokenizer.json etag: "ad0bcbeb288f0d1373d88e0762e66357f55b8311" filename: d53fc0fa09b8342651efd4073d75e19617b3e51287c2a535becda5808a8db287.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
### url: https://huggingface.co/roberta-base/resolve/main/config.json etag: "8db5e7ac5bfc9ec8b613b776009300fe3685d957" filename: 733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Then, I have to disconnect the Internet. Now, the files are cached and should be accessed just fine.
So, we retry to create a tokenizer again, but it failes because without etag, we generate a very different filename:
### url: https://huggingface.co/roberta-base/resolve/main/tokenizer_config.json etag: None filename: dfe8f1ad04cb25b61a647e3d13620f9bf0a0f51d277897b232a5735297134132
The function get_from_cache
has the parameter local_files_only. When, it's true, etag is not computed. However, it is not clear how to use this to enable offline creation of resources after they have been downloaded once.
Thank you!
@searchivarius local_files_only
should indeed work. You can add it to your from_pretrained calls, e.g.
tok = AutoTokenizer.from_pretrained('roberta-base', unk_token='<unk>', local_files_only=True)
That's the very hands-on, manual way to do this for each of your model, config, tokenizer inits. You can also set this globally. See https://github.com/huggingface/transformers/blob/master/docs/source/installation.md#offline-mode
Hi @BramVanroy thanks a lot, TRANSFORMERS_OFFLINE
, indeed, resolves the issue!
it seems very strange for me that local_files_only=True still dosen't work for me even though it works for BertConfig.from_pretrained
i must follow what this https://github.com/huggingface/transformers/issues/12571#issuecomment-901280736 does
I am trying to first download and cache the GPT2 Tokenizer to use on an instance that does not have internet connection. I am able to download the tokenizer on my ec2 instance that does have an internet connection but when i copy over the directory to my instance that does not have the connection it gives a connection error.
The issue seems to be with only the tokenizer and not the model
Environment info
transformers
version: 4.8.1Who can help
Models:
Information
Tokenizer/Model I am using (GPT2, microsoft/DialogRPT-updown):
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
On my ec2 instance that has an internet connection I run
On my ec2 instance which does not have an internet connection I run the same command
Traceback (most recent call last): File "", line 1, in
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1680, in from_pretrained
user_agent=user_agent,
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/file_utils.py", line 1337, in cached_path
local_files_only=local_files_only,
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/file_utils.py", line 1553, in get_from_cache
"Connection error, and we cannot find the requested files in the cached path."
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
Also does not work with AutoTokenizer
Expected behavior
After doing some digging it is looking for the added_tokens_file which does not exist. The vocab_file does exist.