Closed hudson-ai closed 1 month ago
Any chance you have an old version of phi-3-mini in your HF cache? I'm not able to reproduce this on my Mac
I'll take a look. My PR is running into this issue in CI too though (although the other newly opened PRs have no such problem...)
Nothing makes sense and I can't repro on my other machine...
I'll take a look in a minute also
Can't repro on my macbook with the following
import guidance
from guidance import gen, models
model = models.Transformers("microsoft/Phi-3-mini-4k-instruct", echo=False)
@guidance
def gen_test(lm):
lm = lm + "Scott is a " + gen(name="test", max_tokens=40) +"-Hi there"
return lm
model += gen_test()
print(str(model))
output
Scott is a doctor who sees 5 adult patients and 4 child patients every hour. Adult visits cost $50, while child visits cost $30. If Scott works for 8 hours-Hi there
Ok, thanks for the support y'all :) Clearly something funky going on on my side, maybe something stuck in a cache from my PR branch...
@nking-1 can repro, so at least I'm not crazy!
So weird...my best guess is that there may be old versions of phi-3 (which has had updates recently) in the HF cache on your local machines? I downloaded a fresh copy when I tried on my side.
Ooh very funky. @Harsha-Nori it looks like it comes down to the difference between the use_fast=True
and use_fast=False
paths taken when instantiating the AutoTokenizer
...
I can repro by uninstalling sentencepiece
, which causes the use_fast=False
branch to fail (silently, I may add... that exception handler should probably be tightened...). Can you confirm on your side?
I'm still totally unsure why this is causing the llguidance PR branch to fail CI... The workflow still appears to be installing sentencepiece...
Tricky tricky. PR removed protobuf dependency, causing a silent exception in the exact same place:
ImportError:
The new behaviour of LlamaTokenizer (with `self.legacy = False`) requires the protobuf library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
Sorry, only just saw this. I have been wondering if we should be tweaking the TransformersTokenizer
slightly. Specifically swapping these two blocks:
https://github.com/guidance-ai/guidance/blob/181ea1ec5875ca3cd81ea58717b37102f50ef28d/guidance/models/transformers/_transformers.py#L76 https://github.com/guidance-ai/guidance/blob/181ea1ec5875ca3cd81ea58717b37102f50ef28d/guidance/models/transformers/_transformers.py#L89
Basically, prioritise @ambisinister 's method of constructing the byte decoder over the one from sentencepiece (if both are available).
@riedgar-ms it looks like the get_vocab
approach fails when the use_fast=False
path is taken (at least for microsoft/Phi-3-mini-4k-instruct
), and that error was obfuscated by the fact that the sentencepiece is prioritized. It may be worth looking into that issue before prioritizing the new implementation.
There's another issue lurking here though -- the silent fallback to a different tokenizer made this a really difficult bug to track down/repro. I'd advocate for separating the except block into a few cases...
ImportError
-- transformers will raise this if the user needs to install an additional dependency for the fast=False
path to work. I think we should at a minimum give a warning if we are going to fall-back in this case, but I'd actually maybe argue for just raising the exception.AssertionError
-- tokenizer is missing some bytes. Falling back is fine, but maybe it's a warning?My biggest caution here is that I don't actually know why we're falling back here. Was the original intention to fall back in just case 1 and 2? Just 2 and 3? etc.
I suspect that the original intention may have been "This new model is failing, what can we do to get it working?" A more systematic approach is probably in order.
I encountered the same error on linux, and it turned out that sentencepiece
wasn't installed. Installing sentencepiece
resolved the problem.
The bug A clear and concise description of what the bug is.
The issue seems to be with this line. The
'<0x20>'
token is not surviving tokenize/detokenize round-trip, and we get anIndexError
.To Reproduce
System info (please complete the following information):
guidance.__version__
): current head@riedgar-ms @ambisinister tag :)