huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.83k stars 27.19k forks source link

modeling utils don't respect `subfolder` setting when looking up model #13895

Closed stas00 closed 3 years ago

stas00 commented 3 years ago

Not sure yet if it's all 3 look ups but we have a repo where the model is in a sub-folder and this fails:

$ python -c "from transformers import AutoTokenizer, AutoModel; mname = 'bigscience/tr3d-1B3-oscar-checkpoints'; ckpt='global_step117000'; AutoModel.from_pretrained(mname, revision=ckpt, subfolder=ckpt)"
2021-10-05 17:02:26.222006: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
404 Client Error: Not Found for url: https://huggingface.co/bigscience/tr3d-1B3-oscar-checkpoints/resolve/global_step117000/config.json
Traceback (most recent call last):
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/configuration_utils.py", line 546, in get_config_dict
    resolved_config_file = cached_path(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/file_utils.py", line 1402, in cached_path
    output_path = get_from_cache(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/file_utils.py", line 1574, in get_from_cache
    r.raise_for_status()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/bigscience/tr3d-1B3-oscar-checkpoints/resolve/global_step117000/config.json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/auto/auto_factory.py", line 396, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/auto/configuration_auto.py", line 527, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/configuration_utils.py", line 570, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for 'bigscience/tr3d-1B3-oscar-checkpoints'. Make sure that:

- 'bigscience/tr3d-1B3-oscar-checkpoints' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'bigscience/tr3d-1B3-oscar-checkpoints' is the correct path to a directory containing a config.json file

- or 'global_step117000' is a valid git identifier (branch name, a tag name, or a commit id) that exists for this model name as listed on its model page on 'https://huggingface.co/models'

It's trying to check: https://huggingface.co/bigscience/tr3d-1B3-oscar-checkpoints/resolve/global_step117000/config.json but this is an incorrect path it misses the subfolder - the correct path is: https://huggingface.co/bigscience/tr3d-1B3-oscar-checkpoints/resolve/global_step117000/global_step117000/config.json

@LysandreJik, @sgugger

sgugger commented 3 years ago

This is not a bug, the from_pretrained method does not inspect subfolders of repos to create a model. In the branch "global_step100500", the model file and its config should be at the root of the repo, not in a global_step100500 subfolder.

Why use branch plus subfolders? It's as if we had a folder for each version of Transformers on top of using git branches.

stas00 commented 3 years ago

Because the model generated checkpoints as:

checkpoints/global_step100500
checkpoints/global_step101500
[...]

and we remapped each sub-folder to a branch.

All the checkpoints were auto-converted from Megatron to HF in one swoop - 200 checkpoints - and then uploaded to hub.


the from_pretrained method does not inspect subfolders of repos to create a model

we aren't creating a model, but trying to load it.

what is the point of the subfolder arg then?

any objection to make it support subfolders on the hub? Surely, I can easily see a use-case where one would use multiple sub-folders in a single branch and not necessarily use branches.

sgugger commented 3 years ago

I am confused, which subfolder argument are you talking about? This is not an argument accepted by the from_pretrained method.

stas00 commented 3 years ago

oh, do you mean that AutoModel.from_pretrained(mname, revision=ckpt, subfolder=ckpt) just ignores the subfolder kwarg?

I had someone report to me this an Issue, I assumed there was a subfolder arg as they were trying to use it.

sgugger commented 3 years ago

Yes there is no subfolder argument. I am not super enthusiastic at the idea of adding it since we already have the branches to deal with versions and one can always clone the repo then use a local path to access subfolders.

In your use case, there is clearly an overkill in the creation of your repo with branches and subfolders doing the same thing.

sgugger commented 3 years ago

(cc @julien-c @LysandreJik @patrickvonplaten if we want to open the discussion on that feature)

stas00 commented 3 years ago

OK, so there is subfolder in tokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer

I had no idea we had it. So it means that the feature is either inconsistent with the other 2 or confusing to the user, or both. As I said I'm just posting an Issue on behalf of someone reporting this problem.

The relevant doc entry at that URL is:

subfolder (str, optional): In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for facebook/rag-token-base), specify it here.

but I think it may means something else. I'm not sure how to decipher the example - subfolder where? is rag-token-base the implied subfolder of facebook? Probably not, since that's the model_or_path arg already.

Should the example say:

In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for facebook/rag-token-base/special/), specify it here.

If it is an actual sub-folder then config's and model's from_pretrained should mirror it, since all model files reside together.

stas00 commented 3 years ago

And also didn't we change the code a while back to assert if a kwargs is passed but not fully consumed or all keys are accounted for?

If not, then we have a problem, since if a user makes a typo in the arg name and there is no assert they could be under a false impression that they are doing something they want when in fact they don't.

I run into this with deepspeed config file where they don't validate it and when a typo happens it's usually quite painful since a problem doesn't get noticed early enough, as Deepspeed moves on with defaults.

update: indeed:

AutoModel.from_pretrained(mname, reviison=ckpt) # dyslexic typist

Throws no exception and it definitely should. Should I file a separate Issue about it?

patrickvonplaten commented 3 years ago

I think the subfolder for tokenizers was added for multi-tokenizers models such as RAG where two tokenizers are required and it's impossible to have both tokenizers at the root since they would overwrite each other.

LysandreJik commented 3 years ago

I think this subfolder argument made sense for complex models such as RAG (and it could also make sense for complex encoder-decoders which don't use the same tokenizer for the encoder and decoder), but I don't see any need for that feature regarding models.

@stas00 your point regarding the exception that isn't raised with a wrong argument is correct. This should ideally raise an error.

stas00 commented 3 years ago

Aha! Thank you for the explanation, Patrick!

So now a user starting with a tokenizer discovers subfolder and thinks the hub will support it for a model and config and they don't and the don't in a silent way, emitting a misleading error message.

So I propose that we either:

  1. support subfolder for config/model from_pretrained so that the 3 are in sync.
  2. fix from_pretrained to validate its kwargs and assert on unexpected ones

I think we have to do the later sooner or later regardless of the decision wrt subfolder

It's totally fine if we don't support subfolder for BigScience - we will just have to re-work how we do things.

patrickvonplaten commented 3 years ago

Most people aren't aware of it and it's really only for special cases - we maybe shouldn't even have added it.

Overall I'm also not super exciting about a subfolder argument for models, wouldn't it be enough to use the branching system for that, like we do for GPT-J: https://huggingface.co/EleutherAI/gpt-j-6B

Is the problem here that switching between branches takes a ridiculous amount of time?

Overall, I think it's quite important that we make the hub as user-friendly as possible not just for after the model has been trained, but also for during training - so happy to discuss this a bit more!

patrickvonplaten commented 3 years ago

Aha! Thank you for the explanation, Patrick!

So now a user starting with a tokenizer discovers subfolder and thinks the hub will support it for a model and config and they don't and the don't in a silent way, emitting a misleading error message.

So I propose that we either:

  1. support subfolder for config/model from_pretrained so that the 3 are in sync.
  2. fix from_pretrained to validate its kwargs and assert on unexpected ones

I think we have to do the later sooner or later regardless of the decision wrt subfolder

It's totally fine if we don't support subfolder for BigScience - we will just have to re-work how we do things.

2) We should definitely do this! Regarding 1.) - maybe we can also deprecate subfolders or make it private...

stas00 commented 3 years ago

OK, I split off a dedicated issue to validate kwargs https://github.com/huggingface/transformers/issues/13912

We can close this one - unless you want to discuss removing support to subfolder for the tokenizer here.

Thank you for the commentary.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.