guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
18.87k stars 1.04k forks source link

Cannot instantiate `Transformers("microsoft/Phi-3-mini-4k-instruct")` #958

Closed hudson-ai closed 1 month ago

hudson-ai commented 2 months ago

The bug A clear and concise description of what the bug is.

The issue seems to be with this line. The '<0x20>' token is not surviving tokenize/detokenize round-trip, and we get an IndexError.

To Reproduce

from guidance import models
model = models.Transformers("microsoft/Phi-3-mini-4k-instruct")

File ~/pkg/guidance-ai/guidance/models/transformers/_transformers.py:104, in TransformersTokenizer.init(self, model, transformers_tokenizer, chat_template, ignore_bos_token, **kwargs) 102 if hasattr(transformers_tokenizer, "convert_tokens_to_string"): 103 token_str = transformers_tokenizer.convert_tokens_to_string([token]) --> 104 roundtrip_id = transformers_tokenizer.encode(token_str)[0] 105 if roundtrip_id == i: 106 byte_coded = token_str.encode() IndexError: list index out of range

System info (please complete the following information):

@riedgar-ms @ambisinister tag :)

Harsha-Nori commented 2 months ago

Any chance you have an old version of phi-3-mini in your HF cache? I'm not able to reproduce this on my Mac

Harsha-Nori commented 2 months ago
image
hudson-ai commented 2 months ago

I'll take a look. My PR is running into this issue in CI too though (although the other newly opened PRs have no such problem...)

hudson-ai commented 2 months ago

Nothing makes sense and I can't repro on my other machine...

ambisinister commented 2 months ago

I'll take a look in a minute also

ambisinister commented 2 months ago

Can't repro on my macbook with the following

import guidance
from guidance import gen, models

model = models.Transformers("microsoft/Phi-3-mini-4k-instruct", echo=False)

@guidance
def gen_test(lm):
    lm = lm + "Scott is a " + gen(name="test", max_tokens=40) +"-Hi there"
    return lm

model += gen_test()

print(str(model))

output

Scott is a doctor who sees 5 adult patients and 4 child patients every hour. Adult visits cost $50, while child visits cost $30. If Scott works for 8 hours-Hi there
hudson-ai commented 2 months ago

Ok, thanks for the support y'all :) Clearly something funky going on on my side, maybe something stuck in a cache from my PR branch...

hudson-ai commented 2 months ago

@nking-1 can repro, so at least I'm not crazy!

Harsha-Nori commented 2 months ago

So weird...my best guess is that there may be old versions of phi-3 (which has had updates recently) in the HF cache on your local machines? I downloaded a fresh copy when I tried on my side.

hudson-ai commented 2 months ago

Ooh very funky. @Harsha-Nori it looks like it comes down to the difference between the use_fast=True and use_fast=False paths taken when instantiating the AutoTokenizer...

I can repro by uninstalling sentencepiece, which causes the use_fast=False branch to fail (silently, I may add... that exception handler should probably be tightened...). Can you confirm on your side?

I'm still totally unsure why this is causing the llguidance PR branch to fail CI... The workflow still appears to be installing sentencepiece...

hudson-ai commented 2 months ago

Tricky tricky. PR removed protobuf dependency, causing a silent exception in the exact same place:

ImportError: 
The new behaviour of LlamaTokenizer (with `self.legacy = False`) requires the protobuf library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
riedgar-ms commented 2 months ago

Sorry, only just saw this. I have been wondering if we should be tweaking the TransformersTokenizer slightly. Specifically swapping these two blocks:

https://github.com/guidance-ai/guidance/blob/181ea1ec5875ca3cd81ea58717b37102f50ef28d/guidance/models/transformers/_transformers.py#L76 https://github.com/guidance-ai/guidance/blob/181ea1ec5875ca3cd81ea58717b37102f50ef28d/guidance/models/transformers/_transformers.py#L89

Basically, prioritise @ambisinister 's method of constructing the byte decoder over the one from sentencepiece (if both are available).

hudson-ai commented 2 months ago

@riedgar-ms it looks like the get_vocab approach fails when the use_fast=False path is taken (at least for microsoft/Phi-3-mini-4k-instruct), and that error was obfuscated by the fact that the sentencepiece is prioritized. It may be worth looking into that issue before prioritizing the new implementation.

There's another issue lurking here though -- the silent fallback to a different tokenizer made this a really difficult bug to track down/repro. I'd advocate for separating the except block into a few cases...

  1. ImportError -- transformers will raise this if the user needs to install an additional dependency for the fast=False path to work. I think we should at a minimum give a warning if we are going to fall-back in this case, but I'd actually maybe argue for just raising the exception.
  2. AssertionError -- tokenizer is missing some bytes. Falling back is fine, but maybe it's a warning?
  3. Any other error -- maybe don't catch at all until people raise new issues...?

My biggest caution here is that I don't actually know why we're falling back here. Was the original intention to fall back in just case 1 and 2? Just 2 and 3? etc.

riedgar-ms commented 2 months ago

I suspect that the original intention may have been "This new model is failing, what can we do to get it working?" A more systematic approach is probably in order.

Saibo-creator commented 2 months ago

I encountered the same error on linux, and it turned out that sentencepiece wasn't installed. Installing sentencepiece resolved the problem.