Tokenization-wise encoding

hynky1999 commented 2 months ago

What does this implement/fix? Explain your changes.

This PR adds new way to do tokenization.

The method tokenized context and continuation separately which can fix issues when not separating context/continuation for some languages (chinese). See the test for the example Comments
The input parsing for models is really scary to me, very easy to make a mistake imo. I think we do have some shared interface (batch_size, add_bos_token, tokenizer....) for most of the inference models. Would be nice to refactor in the future and have something like TokensLightevalModel and TextLightevalModel (if we allow closed models in future, we won't be the ones doing the batching/tokenization). Won't be doing that in this PR for sure.
I only add this new param to BaseModel and nanotron as for other I haven't noticed we would be using the tokenization params. To me it's weird api, wouldn't deal with it before the above happens

NathanHB commented 2 months ago

Thanks for the PR, looks great :)

The input parsing for models is really scary to me, very easy to make a mistake imo. I think we do have some shared interface (batch_size, add_bos_token, tokenizer....) for most of the inference models. Would be nice to refactor in the future and have something like TokensLightevalModel and TextLightevalModel (if we allow closed models in future, we won't be the ones doing the batching/tokenization). Won't be doing that in this PR for sure.

What do you mean by in put parsing interface ?

hynky1999 commented 2 months ago

Pretty much this fc: https://github.com/huggingface/lighteval/blob/config_templates/src/lighteval/models/model_config.py#L288

I find it scary, because if one is adding a new args/new model configs there is no reference config, pretty much we access everything through untyped dict access. And we actually do have such configs, it's all those XConfig classes. But we can't directly parse into them because they are flattened as I think first the CLI interface was introduced and then the .yaml configs were added.

Having hierarchical config as reference is really nice because:

It's super easy to document (you literally just reference the dataclass, and don't need to update docs manually)
you get nice error code, when you use incorrect config (because the whatever library you use it does that for you)
Enforcing required keys and defaults is much more easier.
We could actually merge nanotron config and accelerate worklow together, as nanotron would be yet another model config
You can share some common stuff between the configs (Generation args etc...)

Now the reason why we use flattened configs is likely because CLI interface right ? I think it would make sense to either:

If one uses cli they can use hierarchical args using . syntax. E.g. 'model=pretrained, model.name=llama'
Make the cli interface minimal and parse it into the hierachical config

It's not a feature, but thing that will make maintaining much easier imo so doesn't have much priority.

NathanHB commented 2 months ago

I see ! I agree with you, I recently added vllm models and it has different configs than base models, our current system is hard to document and clunky for users. I like the idea of having: model=pretrained, model.name=llama in the cli but it would make the cli command much longer to type.

What do you mean by: "Make the cli interface minimal and parse it into the hierachical config" ?

hynky1999 commented 2 months ago

I see ! I agree with you, I recently added vllm models and it has different configs than base models, our current system is hard to document and clunky for users. I like the idea of having: model=pretrained, model.name=llama in the cli but it would make the cli command much longer to type.

What do you mean by: "Make the cli interface minimal and parse it into the hierachical config" ?

We would select minimal usable interface for each model (e.g. for pretrained it could be just name) and only parse that from cli, this makes it a bit easier to maintain

We change the flatten config to hierarchical, so that we can re-use stuff

I like the first method more tho

huggingface / lighteval

Tokenization-wise encoding #287

What does this implement/fix? Explain your changes.

I only add this new param to BaseModel and nanotron as for other I haven't noticed we would be using the tokenization params. To me it's weird api, wouldn't deal with it before the above happens