Closed Rybens92 closed 2 months ago
Hi @Rybens92, thank you so much for reporting this! You're exactly right -- a self consistency check means we're re-tokenizing the input (and re-filling the KV cache) as we're templating text. Phi-3's tokenizer is incredibly odd with it's handling of whitespace characters, where re-tokenizing the same text (text -> tokens -> text -> tokens) produces different tokenizations each time. I thought we patched this but it might not have caught all the edge cases.
Could I ask a few questions?
1) Does this still happen on the release candidate of guidance with our new parser? You can install this with:
pip install guidance --pre -U
and you should see guidance version 0.2.0rc1 being installed.
2) Where did you pull the .gguf files for Phi-3.5-mini and Gemma 2 2B-it? I'd like to make sure I'm using the same GGUFs when debugging on our side.
Thank you, now after installing the pre version the generation time is 332.99 seconds. It is much better and UserWarning does not occur!
As for your second question, I try to use quants from @bartowski and for small models with Q8 or Q6.
Thanks again for your advice and for your help!
The bug When using the Phi3.5 mini, eval and text generation takes several times longer than with Gemma2 2B It, even though they are similarly small models. On top of that, I am experiencing UserWarning: Self-consistency check in _cleanup_tokens() failed. I don't know if this makes a difference.
To Reproduce Use system(), user(), assistant() blocks with Phi? Not sure.
Same question answered takes on: Gemma2 2B It - 162.68 seconds Phi3.5 Mini - 887.36 seconds
Is Guidance processing all prompts once again while the Self-consistency check Warning occurs?
System info (please complete the following information):