Closed stas00 closed 3 years ago
This is not a bug, the from_pretrained
method does not inspect subfolders of repos to create a model. In the branch "global_step100500", the model file and its config should be at the root of the repo, not in a global_step100500
subfolder.
Why use branch plus subfolders? It's as if we had a folder for each version of Transformers on top of using git branches.
Because the model generated checkpoints as:
checkpoints/global_step100500
checkpoints/global_step101500
[...]
and we remapped each sub-folder to a branch.
All the checkpoints were auto-converted from Megatron to HF in one swoop - 200 checkpoints - and then uploaded to hub.
the from_pretrained method does not inspect subfolders of repos to create a model
we aren't creating a model, but trying to load it.
what is the point of the subfolder
arg then?
any objection to make it support subfolders
on the hub? Surely, I can easily see a use-case where one would use multiple sub-folders in a single branch and not necessarily use branches.
I am confused, which subfolder
argument are you talking about? This is not an argument accepted by the from_pretrained
method.
oh, do you mean that AutoModel.from_pretrained(mname, revision=ckpt, subfolder=ckpt)
just ignores the subfolder
kwarg?
I had someone report to me this an Issue, I assumed there was a subfolder
arg as they were trying to use it.
Yes there is no subfolder
argument. I am not super enthusiastic at the idea of adding it since we already have the branches to deal with versions and one can always clone the repo then use a local path to access subfolders.
In your use case, there is clearly an overkill in the creation of your repo with branches and subfolders doing the same thing.
(cc @julien-c @LysandreJik @patrickvonplaten if we want to open the discussion on that feature)
OK, so there is subfolder
in tokenizer https://huggingface.co/transformers/model_doc/auto.html#autotokenizer
I had no idea we had it. So it means that the feature is either inconsistent with the other 2 or confusing to the user, or both. As I said I'm just posting an Issue on behalf of someone reporting this problem.
The relevant doc entry at that URL is:
subfolder (str, optional): In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for facebook/rag-token-base), specify it here.
but I think it may means something else. I'm not sure how to decipher the example - subfolder
where? is rag-token-base
the implied subfolder of facebook
? Probably not, since that's the model_or_path
arg already.
Should the example say:
In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for facebook/rag-token-base/special/), specify it here.
If it is an actual sub-folder then config's and model's from_pretrained
should mirror it, since all model files reside together.
And also didn't we change the code a while back to assert if a kwargs
is passed but not fully consumed or all keys are accounted for?
If not, then we have a problem, since if a user makes a typo in the arg name and there is no assert they could be under a false impression that they are doing something they want when in fact they don't.
I run into this with deepspeed config file where they don't validate it and when a typo happens it's usually quite painful since a problem doesn't get noticed early enough, as Deepspeed moves on with defaults.
update: indeed:
AutoModel.from_pretrained(mname, reviison=ckpt) # dyslexic typist
Throws no exception and it definitely should. Should I file a separate Issue about it?
I think the subfolder for tokenizers
was added for multi-tokenizers models such as RAG where two tokenizers are required and it's impossible to have both tokenizers at the root since they would overwrite each other.
I think this subfolder
argument made sense for complex models such as RAG (and it could also make sense for complex encoder-decoders which don't use the same tokenizer for the encoder and decoder), but I don't see any need for that feature regarding models.
@stas00 your point regarding the exception that isn't raised with a wrong argument is correct. This should ideally raise an error.
Aha! Thank you for the explanation, Patrick!
So now a user starting with a tokenizer discovers subfolder
and thinks the hub will support it for a model and config and they don't and the don't in a silent way, emitting a misleading error message.
So I propose that we either:
subfolder
for config/model from_pretrained
so that the 3 are in sync.from_pretrained
to validate its kwargs
and assert on unexpected onesI think we have to do the later sooner or later regardless of the decision wrt subfolder
It's totally fine if we don't support subfolder
for BigScience - we will just have to re-work how we do things.
Most people aren't aware of it and it's really only for special cases - we maybe shouldn't even have added it.
Overall I'm also not super exciting about a subfolder argument for models, wouldn't it be enough to use the branching system for that, like we do for GPT-J: https://huggingface.co/EleutherAI/gpt-j-6B
Is the problem here that switching between branches takes a ridiculous amount of time?
Overall, I think it's quite important that we make the hub as user-friendly as possible not just for after the model has been trained, but also for during training - so happy to discuss this a bit more!
Aha! Thank you for the explanation, Patrick!
So now a user starting with a tokenizer discovers
subfolder
and thinks the hub will support it for a model and config and they don't and the don't in a silent way, emitting a misleading error message.So I propose that we either:
- support
subfolder
for config/modelfrom_pretrained
so that the 3 are in sync.- fix
from_pretrained
to validate itskwargs
and assert on unexpected onesI think we have to do the later sooner or later regardless of the decision wrt
subfolder
It's totally fine if we don't support
subfolder
for BigScience - we will just have to re-work how we do things.
2) We should definitely do this! Regarding 1.) - maybe we can also deprecate subfolders
or make it private...
OK, I split off a dedicated issue to validate kwargs https://github.com/huggingface/transformers/issues/13912
We can close this one - unless you want to discuss removing support to subfolder
for the tokenizer here.
Thank you for the commentary.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Not sure yet if it's all 3 look ups but we have a repo where the model is in a sub-folder and this fails:
It's trying to check: https://huggingface.co/bigscience/tr3d-1B3-oscar-checkpoints/resolve/global_step117000/config.json but this is an incorrect path it misses the
subfolder
- the correct path is: https://huggingface.co/bigscience/tr3d-1B3-oscar-checkpoints/resolve/global_step117000/global_step117000/config.json@LysandreJik, @sgugger