noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
994 stars 45 forks source link

Probable case of EOS being ignored #115

Closed CortexPE closed 5 days ago

CortexPE commented 1 week ago

I'm currently looking to integrate lm-format-enforcer into exllamav2, but it seems that generation never stops unless the max token limit is exhausted. This might be the case with #110 .

For this test, I used schemas and prompts mentioned in the readme, and colab_exllamav2_integration.ipynb

I'm unsure whether this issue only occurs on llama3 models or if it occurs also on mistral-based or llama2 models.

Example output:

Control:

Here is the information about Michael Jordan in the requested format:

{
  "first_name": "Michael",
  "last_name": "Jordan",
  "year_of_birth": 1963,
  "num_seasons_in_nba": 15
}

Note: The year of birth given is accurate to the best of my knowledge, but please note that it may vary slightly depending on the source.<|eot_id|>

Response generated in 4.10 seconds, 82 tokens, 20.01 tokens/second

ExLlamaV2TokenEnforcerFilter(JsonSchemaParser(None), tokenizer_data):

{"first_name": "Michael", "last_name": "Jordan", "year_of_birth": 1963, "num_seasons_in_nba": 15} 

(... truncated random characters separated by 4 newlines ...)

Response generated in 17.49 seconds, 256 tokens, 14.63 tokens/second

ExLlamaV2TokenEnforcerFilter(JsonSchemaParser(schema), tokenizer_data):

{"first_name": "Michael", "last_name": "Jordan", "year_of_birth": 1963, "num_seasons_in_nba": 15} 

(... truncated random characters separated by 4 newlines ...)

Response generated in 14.98 seconds, 256 tokens, 17.09 tokens/second

Colab Notebook:

https://colab.research.google.com/drive/1hOzIQFvW9xw9bEL86A39497cb3boP6Ou?usp=sharing

CortexPE commented 1 week ago

The problem also occurs with mistral models.

noamgat commented 6 days ago

Hi. I see you have pretty advanced inference code in your sample, I'm not sure what you're doing there. I updated the exllamav2 sample integration notebook with the exllamav2 changes (filters as a parameter to the function and not in sampler settings) and it works. I tried with the model in your notebook (llama3 8b gptq) and it also works. Can you reproduce the problem with the repository's notebook?

CortexPE commented 5 days ago

I've caught the problem, it's settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id]) being honored quite too well while also having our own stopword check.

Removing it would cause infinite generation on streaming generator, but fixes the issue on base generator... so I'll just have to work around this quirk somehow.