AutoTokenizer not loading gpt2 model on instance without internet connection even after caching model

bhedayat commented 3 years ago

I am trying to first download and cache the GPT2 Tokenizer to use on an instance that does not have internet connection. I am able to download the tokenizer on my ec2 instance that does have an internet connection but when i copy over the directory to my instance that does not have the connection it gives a connection error.

The issue seems to be with only the tokenizer and not the model

Environment info

transformers version: 4.8.1
Platform: Linux-4.14.232-176.381.amzn2.x86_64-x86_64-with-glibc2.9
Python version: 3.6.10
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

Models:

gpt2: @patrickvonplaten, @LysandreJik

Information

Tokenizer/Model I am using (GPT2, microsoft/DialogRPT-updown):

The problem arises when using:

[X] the official example scripts: (give details below)

The tasks I am working on is:

[X] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

On my ec2 instance that has an internet connection I run

from transformers import GPT2Tokenizer
GPT2Tokenizer.from_pretrained("gpt2", cache_dir="<some_directory>")

On my ec2 instance which does not have an internet connection I run the same command

from transformers import GPT2Tokenizer
GPT2Tokenizer.from_pretrained("gpt2", cache_dir="<some_directory>")

Traceback (most recent call last): File "", line 1, in File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1680, in from_pretrained user_agent=user_agent, File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/file_utils.py", line 1337, in cached_path local_files_only=local_files_only, File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/file_utils.py", line 1553, in get_from_cache "Connection error, and we cannot find the requested files in the cached path." ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

Also does not work with AutoTokenizer

Expected behavior

After doing some digging it is looking for the added_tokens_file which does not exist. The vocab_file does exist.

bhedayat commented 3 years ago

Seemed to have fixed it by following this https://github.com/huggingface/transformers/issues/9687 and using transformers 4.5.1 instead

ManuelFay commented 3 years ago

Same problem as #12536. @LysandreJik

yipenglinoe commented 3 years ago

i got the same error for load model "bert-base-uncased"

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patrickvonplaten commented 3 years ago

Is this still a problem here? I can load the tokenizer, save it and then load it again without internet connection

ManuelFay commented 3 years ago

Both linked issues were never fixed so I would say so

On Wed, Aug 18, 2021, 6:44 PM Patrick von Platen @.***> wrote:

Is this still a problem here?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/12571#issuecomment-901266168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKLUCABJRZZD7AQL6HZDRITT5PPO7ANCNFSM477UY3MA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

patrickvonplaten commented 3 years ago

A simple workaround would be to just do:

from transformers import GPT2Tokenizer
tok = GPT2Tokenizer.from_pretrained("gpt2", cache_dir="<some_directory>")
tok.save_pretrained("<some_directory>")

and loading it from there without internet, but I guess it would indeed be more userfriendly to allow this automatically once the tokenizer has been downloaded once

ManuelFay commented 3 years ago

I digged a bit more into it in the linked issue #12536 (now stale) and the problem was that non existent files (such as the added tokens json in some of the tokenizers) caused a "breaking" exception offline but a simple warning online, or when the local files only flag was set to true. As you said, the workaround is super simple (even just setting local files only to true fixes it ) but it's just UX

ManuelFay commented 3 years ago

In the other issue, I proposed a simple (very naive fix) as a PR that circumvented this behavior but I suspect it might break things elsewhere (and would require changing a pipeline test)

searchivarius commented 3 years ago

Hi everybody, I am getting the same error and after digging a bit deeper, I believe that the current caching mechanism depends on the Internet connection crucially for latest versions, e.g., 4.8.x and 4.9.2. I blame the function get_from_cache, which IMHO shouldn't work properly unless you always have Internet. Some details are below.

Simple code to reproduce the effect:

from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained('roberta-base', unk_token='<unk>')

First, specifying the caching directory doesn't help, because the function get_from_cache computes the caching path using the so-caled etag:

filename = url_to_filename(url, etag)

I added a code to print the filename, the url, and the etag. When Internet is there, we get:

### url: https://huggingface.co/roberta-base/resolve/main/config.json etag: "8db5e7ac5bfc9ec8b613b776009300fe3685d957" filename: 733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
### url: https://huggingface.co/roberta-base/resolve/main/vocab.json etag: "5606f48548d99a9829d10a96cd364b816b02cd21" filename: d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
### url: https://huggingface.co/roberta-base/resolve/main/merges.txt etag: "226b0752cac7789c48f0cb3ec53eda48b7be36cc" filename: cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
### url: https://huggingface.co/roberta-base/resolve/main/tokenizer.json etag: "ad0bcbeb288f0d1373d88e0762e66357f55b8311" filename: d53fc0fa09b8342651efd4073d75e19617b3e51287c2a535becda5808a8db287.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
### url: https://huggingface.co/roberta-base/resolve/main/config.json etag: "8db5e7ac5bfc9ec8b613b776009300fe3685d957" filename: 733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b

Then, I have to disconnect the Internet. Now, the files are cached and should be accessed just fine.

So, we retry to create a tokenizer again, but it failes because without etag, we generate a very different filename:

### url: https://huggingface.co/roberta-base/resolve/main/tokenizer_config.json etag: None filename: dfe8f1ad04cb25b61a647e3d13620f9bf0a0f51d277897b232a5735297134132

The function get_from_cache has the parameter local_files_only. When, it's true, etag is not computed. However, it is not clear how to use this to enable offline creation of resources after they have been downloaded once.

Thank you!

BramVanroy commented 3 years ago

@searchivarius local_files_only should indeed work. You can add it to your from_pretrained calls, e.g.

tok = AutoTokenizer.from_pretrained('roberta-base', unk_token='<unk>', local_files_only=True)

That's the very hands-on, manual way to do this for each of your model, config, tokenizer inits. You can also set this globally. See https://github.com/huggingface/transformers/blob/master/docs/source/installation.md#offline-mode

searchivarius commented 3 years ago

Hi @BramVanroy thanks a lot, TRANSFORMERS_OFFLINE, indeed, resolves the issue!

Z-MU-Z commented 2 years ago

it seems very strange for me that local_files_only=True still dosen't work for me even though it works for BertConfig.from_pretrained

i must follow what this https://github.com/huggingface/transformers/issues/12571#issuecomment-901280736 does

huggingface / transformers