Closed hynky1999 closed 2 months ago
Thanks for the PR, looks great :)
The input parsing for models is really scary to me, very easy to make a mistake imo. I think we do have some shared interface (batch_size, add_bos_token, tokenizer....) for most of the inference models. Would be nice to refactor in the future and have something like TokensLightevalModel and TextLightevalModel (if we allow closed models in future, we won't be the ones doing the batching/tokenization). Won't be doing that in this PR for sure.
What do you mean by in put parsing interface ?
Pretty much this fc: https://github.com/huggingface/lighteval/blob/config_templates/src/lighteval/models/model_config.py#L288
I find it scary, because if one is adding a new args/new model configs there is no reference config, pretty much we access everything through untyped dict access.
And we actually do have such configs, it's all those XConfig
classes. But we can't directly parse into them because they are flattened as I think first the CLI interface was introduced and then the .yaml configs were added.
Having hierarchical config as reference is really nice because:
Now the reason why we use flattened configs is likely because CLI interface right ? I think it would make sense to either:
It's not a feature, but thing that will make maintaining much easier imo so doesn't have much priority.
I see ! I agree with you, I recently added vllm models and it has different configs than base models, our current system is hard to document and clunky for users. I like the idea of having: model=pretrained, model.name=llama
in the cli but it would make the cli command much longer to type.
What do you mean by: "Make the cli interface minimal and parse it into the hierachical config" ?
I see ! I agree with you, I recently added vllm models and it has different configs than base models, our current system is hard to document and clunky for users. I like the idea of having:
model=pretrained, model.name=llama
in the cli but it would make the cli command much longer to type.What do you mean by: "Make the cli interface minimal and parse it into the hierachical config" ?
- We would select minimal usable interface for each model (e.g. for pretrained it could be just name) and only parse that from cli, this makes it a bit easier to maintain
- We change the flatten config to hierarchical, so that we can re-use stuff
I like the first method more tho
What does this implement/fix? Explain your changes.
This PR adds new way to do tokenization.
I only add this new param to BaseModel and nanotron as for other I haven't noticed we would be using the tokenization params. To me it's weird api, wouldn't deal with it before the above happens