noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

relationship between underlying model quality and time to complete generation #20

Closed mcapizzi-cohere closed 10 months ago

mcapizzi-cohere commented 11 months ago

This is both (1) a very naive question and (2) probably best suited for another forum, but I'll ask it anyway.

Is there a relationship between (1) the underlying quality of the LLM used and (2) the time it takes to complete the prompt completion? Take two models for example:

  1. model one -> very good model that, during greedy decoding, could actually complete the desired output without any constraints
  2. model two -> a very "poor" model that can only complete the desired output with constraints

Will one of those models complete the generation faster?

This question reveals my lack of detailed understanding of both (1) greedy decoding in general and (2) this implementation but I'd appreciate some more intuition on the question as it will help us decide if it's "worth" using a stronger (most likely larger w.r.t. parameters) model in our application.

noamgat commented 10 months ago

The second model will likely perform faster. However, I have seen that the more interference is required by the LM Format Enforcer, the more likely you will get low quality answers. You will have to judge on your specific use case. The LM format enforcer's performance footprint is the same regardless of how much it has to change. Also, it supports LLM features such as beam search, so you are not forced into greedy decoding with it.