Closed CortexPE closed 5 days ago
The problem also occurs with mistral models.
Hi. I see you have pretty advanced inference code in your sample, I'm not sure what you're doing there. I updated the exllamav2 sample integration notebook with the exllamav2 changes (filters as a parameter to the function and not in sampler settings) and it works. I tried with the model in your notebook (llama3 8b gptq) and it also works. Can you reproduce the problem with the repository's notebook?
I've caught the problem, it's settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])
being honored quite too well while also having our own stopword check.
Removing it would cause infinite generation on streaming generator, but fixes the issue on base generator... so I'll just have to work around this quirk somehow.
I'm currently looking to integrate lm-format-enforcer into exllamav2, but it seems that generation never stops unless the max token limit is exhausted. This might be the case with #110 .
For this test, I used schemas and prompts mentioned in the readme, and colab_exllamav2_integration.ipynb
I'm unsure whether this issue only occurs on llama3 models or if it occurs also on mistral-based or llama2 models.
Example output:
Control:
ExLlamaV2TokenEnforcerFilter(JsonSchemaParser(None), tokenizer_data)
:ExLlamaV2TokenEnforcerFilter(JsonSchemaParser(schema), tokenizer_data)
:Colab Notebook:
https://colab.research.google.com/drive/1hOzIQFvW9xw9bEL86A39497cb3boP6Ou?usp=sharing