huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
MIT License
845 stars 100 forks source link

Tokenization-wise encoding #287

Closed hynky1999 closed 2 months ago

hynky1999 commented 2 months ago

What does this implement/fix? Explain your changes.

This PR adds new way to do tokenization.

NathanHB commented 2 months ago

Thanks for the PR, looks great :)

The input parsing for models is really scary to me, very easy to make a mistake imo. I think we do have some shared interface (batch_size, add_bos_token, tokenizer....) for most of the inference models. Would be nice to refactor in the future and have something like TokensLightevalModel and TextLightevalModel (if we allow closed models in future, we won't be the ones doing the batching/tokenization). Won't be doing that in this PR for sure.

What do you mean by in put parsing interface ?

hynky1999 commented 2 months ago

Pretty much this fc: https://github.com/huggingface/lighteval/blob/config_templates/src/lighteval/models/model_config.py#L288

I find it scary, because if one is adding a new args/new model configs there is no reference config, pretty much we access everything through untyped dict access. And we actually do have such configs, it's all those XConfig classes. But we can't directly parse into them because they are flattened as I think first the CLI interface was introduced and then the .yaml configs were added.

Having hierarchical config as reference is really nice because:

  1. It's super easy to document (you literally just reference the dataclass, and don't need to update docs manually)
  2. you get nice error code, when you use incorrect config (because the whatever library you use it does that for you)
  3. Enforcing required keys and defaults is much more easier.
  4. We could actually merge nanotron config and accelerate worklow together, as nanotron would be yet another model config
  5. You can share some common stuff between the configs (Generation args etc...)

Now the reason why we use flattened configs is likely because CLI interface right ? I think it would make sense to either:

  1. If one uses cli they can use hierarchical args using . syntax. E.g. 'model=pretrained, model.name=llama'
  2. Make the cli interface minimal and parse it into the hierachical config

It's not a feature, but thing that will make maintaining much easier imo so doesn't have much priority.

NathanHB commented 2 months ago

I see ! I agree with you, I recently added vllm models and it has different configs than base models, our current system is hard to document and clunky for users. I like the idea of having: model=pretrained, model.name=llama in the cli but it would make the cli command much longer to type.

What do you mean by: "Make the cli interface minimal and parse it into the hierachical config" ?

hynky1999 commented 2 months ago

I see ! I agree with you, I recently added vllm models and it has different configs than base models, our current system is hard to document and clunky for users. I like the idea of having: model=pretrained, model.name=llama in the cli but it would make the cli command much longer to type.

What do you mean by: "Make the cli interface minimal and parse it into the hierachical config" ?

  1. We would select minimal usable interface for each model (e.g. for pretrained it could be just name) and only parse that from cli, this makes it a bit easier to maintain
  2. We change the flatten config to hierarchical, so that we can re-use stuff

I like the first method more tho