Open crclark opened 7 months ago
Extremely hacky, but I managed to work around this by passing my own byte_decoder as part of the tokenizer.
e.g. for Llama 3, you can brute force reverse engineer the byte encodings like this. For completeness, you then need to add in the encodings for values which are not valid in Unicode, but it can be observed that Llama assigned the encodings alphabetically. Then just assign the byte_decoder to the tokenizer and you are good to go:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
padding_side="left",
trust_remote_code=True, # token=hf_token
)
byte_decoder = {}
alphabet = pre_tokenizers.ByteLevel(False, False).alphabet()
known_vals = set([])
for j in range(256):
for k in range(256):
for l in range(256):
if len(byte_decoder.keys()) < 256:
b = b""
vals = [j,k,l]
if not set(vals).issubset(known_vals):
for d in range(3):
b = b + int.to_bytes(vals[d])
try:
c = b.decode()
t = pre_tokenizers.ByteLevel(False,False).pre_tokenize_str(c)[0][0]
for m in range(3):
if t[m] not in byte_decoder.keys():
byte_decoder[t[m]] = vals[m]
known_vals.add(vals[m])
except UnicodeDecodeError:
pass
print(len(byte_decoder))
byte_decoder['À'] = 192
byte_decoder['Á'] = 193
byte_decoder['ð'] = 240
byte_decoder['ñ'] = 241
byte_decoder['ò'] = 242
byte_decoder['ó'] = 243
byte_decoder['ô'] = 244
byte_decoder['õ'] = 245
byte_decoder['ö'] = 246
byte_decoder['÷'] = 247
byte_decoder['ø'] = 248
byte_decoder['ù'] = 249
byte_decoder['ú'] = 250
byte_decoder['û'] = 251
byte_decoder['ü'] = 252
byte_decoder['ý'] = 253
byte_decoder['þ'] = 254
byte_decoder['ÿ'] = 255
from guidance import models
tokenizer.byte_decoder = byte_decoder
lm = models.Transformers("meta-llama/Meta-Llama-3-8B-Instruct", tokenizer=tokenizer)
I am unable to use guidance due to this bug. @Harsha-Nori do you know if a fix forthcoming? Thanks.
Extremely hacky, but I managed to work around this by passing my own byte_decoder as part of the tokenizer.
e.g. for Llama 3, you can brute force reverse engineer the byte encodings like this. For completeness, you then need to add in the encodings for values which are not valid in Unicode, but it can be observed that Llama assigned the encodings alphabetically. Then just assign the byte_decoder to the tokenizer and you are good to go:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", padding_side="left", trust_remote_code=True, # token=hf_token ) byte_decoder = {} alphabet = pre_tokenizers.ByteLevel(False, False).alphabet() known_vals = set([]) for j in range(256): for k in range(256): for l in range(256): if len(byte_decoder.keys()) < 256: b = b"" vals = [j,k,l] if not set(vals).issubset(known_vals): for d in range(3): b = b + int.to_bytes(vals[d]) try: c = b.decode() t = pre_tokenizers.ByteLevel(False,False).pre_tokenize_str(c)[0][0] for m in range(3): if t[m] not in byte_decoder.keys(): byte_decoder[t[m]] = vals[m] known_vals.add(vals[m]) except UnicodeDecodeError: pass print(len(byte_decoder)) byte_decoder['À'] = 192 byte_decoder['Á'] = 193 byte_decoder['ð'] = 240 byte_decoder['ñ'] = 241 byte_decoder['ò'] = 242 byte_decoder['ó'] = 243 byte_decoder['ô'] = 244 byte_decoder['õ'] = 245 byte_decoder['ö'] = 246 byte_decoder['÷'] = 247 byte_decoder['ø'] = 248 byte_decoder['ù'] = 249 byte_decoder['ú'] = 250 byte_decoder['û'] = 251 byte_decoder['ü'] = 252 byte_decoder['ý'] = 253 byte_decoder['þ'] = 254 byte_decoder['ÿ'] = 255 from guidance import models tokenizer.byte_decoder = byte_decoder lm = models.Transformers("meta-llama/Meta-Llama-3-8B-Instruct", tokenizer=tokenizer)
what's the pre_tokenizers
? How to import it?
Extremely hacky, but I managed to work around this by passing my own byte_decoder as part of the tokenizer.
e.g. for Llama 3, you can brute force reverse engineer the byte encodings like this. For completeness, you then need to add in the encodings for values which are not valid in Unicode, but it can be observed that Llama assigned the encodings alphabetically. Then just assign the byte_decoder to the tokenizer and you are good to go:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", padding_side="left", trust_remote_code=True, # token=hf_token ) byte_decoder = {} alphabet = pre_tokenizers.ByteLevel(False, False).alphabet() known_vals = set([]) for j in range(256): for k in range(256): for l in range(256): if len(byte_decoder.keys()) < 256: b = b"" vals = [j,k,l] if not set(vals).issubset(known_vals): for d in range(3): b = b + int.to_bytes(vals[d]) try: c = b.decode() t = pre_tokenizers.ByteLevel(False,False).pre_tokenize_str(c)[0][0] for m in range(3): if t[m] not in byte_decoder.keys(): byte_decoder[t[m]] = vals[m] known_vals.add(vals[m]) except UnicodeDecodeError: pass print(len(byte_decoder)) byte_decoder['À'] = 192 byte_decoder['Á'] = 193 byte_decoder['ð'] = 240 byte_decoder['ñ'] = 241 byte_decoder['ò'] = 242 byte_decoder['ó'] = 243 byte_decoder['ô'] = 244 byte_decoder['õ'] = 245 byte_decoder['ö'] = 246 byte_decoder['÷'] = 247 byte_decoder['ø'] = 248 byte_decoder['ù'] = 249 byte_decoder['ú'] = 250 byte_decoder['û'] = 251 byte_decoder['ü'] = 252 byte_decoder['ý'] = 253 byte_decoder['þ'] = 254 byte_decoder['ÿ'] = 255 from guidance import models tokenizer.byte_decoder = byte_decoder lm = models.Transformers("meta-llama/Meta-Llama-3-8B-Instruct", tokenizer=tokenizer)
what's the
pre_tokenizers
? How to import it?
Ah sorry, it's from the Hugging Face tokenizers library.
from tokenizers import pre_tokenizers
Extremely hacky, but I managed to work around this by passing my own byte_decoder as part of the tokenizer.
e.g. for Llama 3, you can brute force reverse engineer the byte encodings like this. For completeness, you then need to add in the encodings for values which are not valid in Unicode, but it can be observed that Llama assigned the encodings alphabetically. Then just assign the byte_decoder to the tokenizer and you are good to go:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", padding_side="left", trust_remote_code=True, # token=hf_token ) byte_decoder = {} alphabet = pre_tokenizers.ByteLevel(False, False).alphabet() known_vals = set([]) for j in range(256): for k in range(256): for l in range(256): if len(byte_decoder.keys()) < 256: b = b"" vals = [j,k,l] if not set(vals).issubset(known_vals): for d in range(3): b = b + int.to_bytes(vals[d]) try: c = b.decode() t = pre_tokenizers.ByteLevel(False,False).pre_tokenize_str(c)[0][0] for m in range(3): if t[m] not in byte_decoder.keys(): byte_decoder[t[m]] = vals[m] known_vals.add(vals[m]) except UnicodeDecodeError: pass print(len(byte_decoder)) byte_decoder['À'] = 192 byte_decoder['Á'] = 193 byte_decoder['ð'] = 240 byte_decoder['ñ'] = 241 byte_decoder['ò'] = 242 byte_decoder['ó'] = 243 byte_decoder['ô'] = 244 byte_decoder['õ'] = 245 byte_decoder['ö'] = 246 byte_decoder['÷'] = 247 byte_decoder['ø'] = 248 byte_decoder['ù'] = 249 byte_decoder['ú'] = 250 byte_decoder['û'] = 251 byte_decoder['ü'] = 252 byte_decoder['ý'] = 253 byte_decoder['þ'] = 254 byte_decoder['ÿ'] = 255 from guidance import models tokenizer.byte_decoder = byte_decoder lm = models.Transformers("meta-llama/Meta-Llama-3-8B-Instruct", tokenizer=tokenizer)
Hello, I found a bug in your code. The code
b = b + int.to_bytes(vals[d])
should be change to
b = b + vals[d].to_bytes(1, 'little', signed=False)
Thanks! I swapped it out and found no difference on my local setup, but setting specific arguments should be more robust across installations.
Hey, @paulbkoch and I are looking into this, though we're balancing quite a few things on our plate at the moment. Would be really helpful to collect a list of models where this error occurs (beyond 'meta-llama/Meta-Llama-3-8B') if anyone else has run into this recently.
@Harsha-Nori I just encountered this issue with another model called core42/jais-13b
.
Hope this helps
Gemma models also don't work. They directly encode the bytes as tokens (e.g. "ൠ" is tokenized as '<0xE0>' '<0xB5>' '<0xA0>'). This doesn't translate well into the byte_decoder paradigm as tested by guidance.
EDIT: I was using 0.1.13, but this is fixed in 0.1.14 by treating sentencepiece models separately. (Requires setting use_fast=False on the tokenizer)
Is there any up to date list of models that are supported? I just tried llama 3 8B instruct and mistral 7b instruct and neither works. I saw from the README that 'gpt2' was used with transformers and so I can use that, but would prefer to use newer models
Is there any up to date list of models that are supported? I just tried llama 3 8B instruct and mistral 7b instruct and neither works. I saw from the README that 'gpt2' was used with transformers and so I can use that, but would prefer to use newer models.
I've had the following working in the last 24 hours:
For many of them you need to load the tokenizer with use_fast=False in order to have the sentencepiece model available to guidance. (Requires on 0.1.14)
Hey @am-bean, thanks for chasing this down! We should be loading with use_fast=False eventually -- do you know if this is happening automatically, or do you have to manually instantiate and pass in the tokenizer yourself? If so, I think there's a bug I'd like to chase down.
Glad to hear all these models work for you!
I've had the following working in the last 24 hours:
- Llama 3 8/70B instruct
- Mixtral 8x7B
- Gemma 7B
- Llama 2 70B
For many of them you need to load the tokenizer with use_fast=False in order to have the sentencepiece model available to guidance. (Requires on 0.1.14)
I tried updating to 0.1.14, instantiated a tokenizer with use_fast=False, but the model still fails with the same error
`tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", use_fast=False )
llama3 = models.Transformers("meta-llama/Meta-Llama-3-8B-Instruct", tokenizer=tokenizer) ` still gives AssertionError: The passed tokenizer does not have a byte_decoder property and using a standard gpt2 byte_decoder fails!
Hey @am-bean, thanks for chasing this down! We should be loading with use_fast=False eventually -- do you know if this is happening automatically, or do you have to manually instantiate and pass in the tokenizer yourself? If so, I think there's a bug I'd like to chase down.
Glad to hear all these models work for you!
I needed to instantiate the tokenizers myself and pass them in at least for Mixtral. I didn't test the others since it was easier just to have consistent code for all the models.
I arrived at loading them separately as a solution because the error was indicating that I had ended up trying to load in gpt2 as the tokenizer, which meant that hasattr(tokenizer, "sp_model") was False. Unfortunately, at a quick scan I agree with your expected behaviour and don't see the bug.
I've had the following working in the last 24 hours:
- Llama 3 8/70B instruct
- Mixtral 8x7B
- Gemma 7B
- Llama 2 70B
For many of them you need to load the tokenizer with use_fast=False in order to have the sentencepiece model available to guidance. (Requires on 0.1.14)
I tried updating to 0.1.14, instantiated a tokenizer with use_fast=False, but the model still fails with the same error
`tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", use_fast=False )
llama3 = models.Transformers("meta-llama/Meta-Llama-3-8B-Instruct", tokenizer=tokenizer) ` still gives AssertionError: The passed tokenizer does not have a byte_decoder property and using a standard gpt2 byte_decoder fails!
Llama 3 uses a different type of tokenizer than the rest in that list. You'll need to try follow the instructions I posted way above about setting a custom byte_decoder.
I've had the following working in the last 24 hours:
- Llama 3 8/70B instruct
- Mixtral 8x7B
- Gemma 7B
- Llama 2 70B
For many of them you need to load the tokenizer with use_fast=False in order to have the sentencepiece model available to guidance. (Requires on 0.1.14)
I tried updating to 0.1.14, instantiated a tokenizer with use_fast=False, but the model still fails with the same error
tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", use_fast=False ) llama3 = models.Transformers("meta-llama/Meta-Llama-3-8B-Instruct", tokenizer=tokenizer)
still gives AssertionError: The passed tokenizer does not have a byte_decoder property and using a standard gpt2 byte_decoder fails!Llama 3 uses a different type of tokenizer than the rest in that list. You'll need to try follow the instructions I posted way above about setting a custom byte_decoder.
Thanks @am-bean. Yes your fix works. I was hoping for a less (in your words) "hacky" solution.
Hey @am-bean, thanks for chasing this down! We should be loading with use_fast=False eventually -- do you know if this is happening automatically, or do you have to manually instantiate and pass in the tokenizer yourself? If so, I think there's a bug I'd like to chase down.
Glad to hear all these models work for you!
hi, @Harsha-Nori, I would like to ask, my current model is using BloomTokenizer, which does not have the attributes of byte_decoder and sp_model. Therefore, according to the code, it uses the byte_decoder from gpt2. I would like to know, does this affect the final use of my model? Additionally, I would also like to ask, what roles do the byte_decoder and sp_model play for a tokenizer? Thank you.
Extremely hacky, but I managed to work around this by passing my own byte_decoder as part of the tokenizer.
e.g. for Llama 3, you can brute force reverse engineer the byte encodings like this. For completeness, you then need to add in the encodings for values which are not valid in Unicode, but it can be observed that Llama assigned the encodings alphabetically. Then just assign the byte_decoder to the tokenizer and you are good to go:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", padding_side="left", trust_remote_code=True, # token=hf_token ) byte_decoder = {} alphabet = pre_tokenizers.ByteLevel(False, False).alphabet() known_vals = set([]) for j in range(256): for k in range(256): for l in range(256): if len(byte_decoder.keys()) < 256: b = b"" vals = [j,k,l] if not set(vals).issubset(known_vals): for d in range(3): b = b + int.to_bytes(vals[d]) try: c = b.decode() t = pre_tokenizers.ByteLevel(False,False).pre_tokenize_str(c)[0][0] for m in range(3): if t[m] not in byte_decoder.keys(): byte_decoder[t[m]] = vals[m] known_vals.add(vals[m]) except UnicodeDecodeError: pass print(len(byte_decoder)) byte_decoder['À'] = 192 byte_decoder['Á'] = 193 byte_decoder['ð'] = 240 byte_decoder['ñ'] = 241 byte_decoder['ò'] = 242 byte_decoder['ó'] = 243 byte_decoder['ô'] = 244 byte_decoder['õ'] = 245 byte_decoder['ö'] = 246 byte_decoder['÷'] = 247 byte_decoder['ø'] = 248 byte_decoder['ù'] = 249 byte_decoder['ú'] = 250 byte_decoder['û'] = 251 byte_decoder['ü'] = 252 byte_decoder['ý'] = 253 byte_decoder['þ'] = 254 byte_decoder['ÿ'] = 255 from guidance import models tokenizer.byte_decoder = byte_decoder lm = models.Transformers("meta-llama/Meta-Llama-3-8B-Instruct", tokenizer=tokenizer)
hey, @am-bean, I've encountered the same problem as you. It seems your implementation method is effective, but I don't fully get it, so I would like to ask for your guidance. I don't understand why the byte_decoder would differ for different tokenizers. My understanding of it is quite superficial. For instance, in gpt2, the byte_decoder is established by finding 256 visible bytes to serve as the basic vocabulary for byte-level, so that any utf-8 character can be split into these bytes' concatenation to facilitate subsequent merges. From my perspective, how the mapping relationship is established doesn't matter as long as such a relationship exists. So where did my understanding go wrong? If I'm using bloomtokenizer(use_fast = True), how should I implement a byte_decoder? Thank you in advance.
Is anyone getting a KeyError: "_"?
def add_byte_decoder():
byte_decoder = {}
alphabet = pre_tokenizers.ByteLevel(False, False).alphabet()
known_vals = set([])
for j in range(256):
for k in range(256):
for l in range(256):
if len(byte_decoder.keys()) < 256:
b = b""
vals = [j,k,l]
if not set(vals).issubset(known_vals):
for d in range(3):
b = b + int.to_bytes(vals[d])
try:
c = b.decode()
t = pre_tokenizers.ByteLevel(False,False).pre_tokenize_str(c)[0][0]
for m in range(3):
if t[m] not in byte_decoder.keys():
byte_decoder[t[m]] = vals[m]
known_vals.add(vals[m])
except UnicodeDecodeError:
pass
byte_decoder['À'] = 192
byte_decoder['Á'] = 193
byte_decoder['ð'] = 240
byte_decoder['ñ'] = 241
byte_decoder['ò'] = 242
byte_decoder['ó'] = 243
byte_decoder['ô'] = 244
byte_decoder['õ'] = 245
byte_decoder['ö'] = 246
byte_decoder['÷'] = 247
byte_decoder['ø'] = 248
byte_decoder['ù'] = 249
byte_decoder['ú'] = 250
byte_decoder['û'] = 251
byte_decoder['ü'] = 252
byte_decoder['ý'] = 253
byte_decoder['þ'] = 254
byte_decoder['ÿ'] = 255
return byte_decoder
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", padding_side="left", trust_remote_code=True)
tokenizer.byte_decoder = add_byte_decoder()
model = AutoModelForCausalLM.from_pretrained(name, quantization_config=nf4_config, device_map="auto")
lm = guidance.models.Transformers(model=model, tokenizer=tokenizer)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.25s/it]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[33], line 4
2 tokenizer.byte_decoder = add_byte_decoder()
3 model = AutoModelForCausalLM.from_pretrained(name, quantization_config=nf4_config, device_map="auto")
----> 4 lm = guidance.models.Transformers(model=model, tokenizer=tokenizer)
File ~/miniconda3/envs/dafee/lib/python3.11/site-packages/guidance/models/transformers/_transformers.py:196, in Transformers.__init__(self, model, tokenizer, echo, compute_log_probs, **kwargs)
193 def __init__(self, model=None, tokenizer=None, echo=True, compute_log_probs=False, **kwargs):
194 '''Build a new Transformers model object that represents a model in a given state.'''
195 super().__init__(
--> 196 TransformersEngine(model, tokenizer, compute_log_probs, **kwargs),
197 echo=echo
198 )
File ~/miniconda3/envs/dafee/lib/python3.11/site-packages/guidance/models/transformers/_transformers.py:101, in TransformersEngine.__init__(self, model, tokenizer, compute_log_probs, **kwargs)
97 self._cached_logits = None
98 self._cached_token_ids = []
100 super().__init__(
--> 101 TransformersTokenizer(model, tokenizer),
102 compute_log_probs=compute_log_probs
103 )
File ~/miniconda3/envs/dafee/lib/python3.11/site-packages/guidance/models/transformers/_transformers.py:41, in TransformersTokenizer.__init__(self, model, tokenizer, ignore_bos_token)
39 byte_tokens = []
40 for i in range(len(tokenizer)):
---> 41 byte_coded = bytes([byte_decoder[c] for c in tokenizer.convert_ids_to_tokens(i)])
42 byte_tokens.append(byte_coded)
46 # the superclass does most of the work once we have the tokens
File ~/miniconda3/envs/dafee/lib/python3.11/site-packages/guidance/models/transformers/_transformers.py:41, in <listcomp>(.0)
39 byte_tokens = []
40 for i in range(len(tokenizer)):
---> 41 byte_coded = bytes([byte_decoder[c] for c in tokenizer.convert_ids_to_tokens(i)])
42 byte_tokens.append(byte_coded)
46 # the superclass does most of the work once we have the tokens
KeyError: '▁'
Is anyone getting a KeyError: "_"?
def add_byte_decoder(): byte_decoder = {} alphabet = pre_tokenizers.ByteLevel(False, False).alphabet() known_vals = set([]) for j in range(256): for k in range(256): for l in range(256): if len(byte_decoder.keys()) < 256: b = b"" vals = [j,k,l] if not set(vals).issubset(known_vals): for d in range(3): b = b + int.to_bytes(vals[d]) try: c = b.decode() t = pre_tokenizers.ByteLevel(False,False).pre_tokenize_str(c)[0][0] for m in range(3): if t[m] not in byte_decoder.keys(): byte_decoder[t[m]] = vals[m] known_vals.add(vals[m]) except UnicodeDecodeError: pass byte_decoder['À'] = 192 byte_decoder['Á'] = 193 byte_decoder['ð'] = 240 byte_decoder['ñ'] = 241 byte_decoder['ò'] = 242 byte_decoder['ó'] = 243 byte_decoder['ô'] = 244 byte_decoder['õ'] = 245 byte_decoder['ö'] = 246 byte_decoder['÷'] = 247 byte_decoder['ø'] = 248 byte_decoder['ù'] = 249 byte_decoder['ú'] = 250 byte_decoder['û'] = 251 byte_decoder['ü'] = 252 byte_decoder['ý'] = 253 byte_decoder['þ'] = 254 byte_decoder['ÿ'] = 255 return byte_decoder tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", padding_side="left", trust_remote_code=True) tokenizer.byte_decoder = add_byte_decoder() model = AutoModelForCausalLM.from_pretrained(name, quantization_config=nf4_config, device_map="auto") lm = guidance.models.Transformers(model=model, tokenizer=tokenizer) Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.25s/it] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[33], line 4 2 tokenizer.byte_decoder = add_byte_decoder() 3 model = AutoModelForCausalLM.from_pretrained(name, quantization_config=nf4_config, device_map="auto") ----> 4 lm = guidance.models.Transformers(model=model, tokenizer=tokenizer) File ~/miniconda3/envs/dafee/lib/python3.11/site-packages/guidance/models/transformers/_transformers.py:196, in Transformers.__init__(self, model, tokenizer, echo, compute_log_probs, **kwargs) 193 def __init__(self, model=None, tokenizer=None, echo=True, compute_log_probs=False, **kwargs): 194 '''Build a new Transformers model object that represents a model in a given state.''' 195 super().__init__( --> 196 TransformersEngine(model, tokenizer, compute_log_probs, **kwargs), 197 echo=echo 198 ) File ~/miniconda3/envs/dafee/lib/python3.11/site-packages/guidance/models/transformers/_transformers.py:101, in TransformersEngine.__init__(self, model, tokenizer, compute_log_probs, **kwargs) 97 self._cached_logits = None 98 self._cached_token_ids = [] 100 super().__init__( --> 101 TransformersTokenizer(model, tokenizer), 102 compute_log_probs=compute_log_probs 103 ) File ~/miniconda3/envs/dafee/lib/python3.11/site-packages/guidance/models/transformers/_transformers.py:41, in TransformersTokenizer.__init__(self, model, tokenizer, ignore_bos_token) 39 byte_tokens = [] 40 for i in range(len(tokenizer)): ---> 41 byte_coded = bytes([byte_decoder[c] for c in tokenizer.convert_ids_to_tokens(i)]) 42 byte_tokens.append(byte_coded) 46 # the superclass does most of the work once we have the tokens File ~/miniconda3/envs/dafee/lib/python3.11/site-packages/guidance/models/transformers/_transformers.py:41, in <listcomp>(.0) 39 byte_tokens = [] 40 for i in range(len(tokenizer)): ---> 41 byte_coded = bytes([byte_decoder[c] for c in tokenizer.convert_ids_to_tokens(i)]) 42 byte_tokens.append(byte_coded) 46 # the superclass does most of the work once we have the tokens KeyError: '▁'
You'll need to adapt the code a bit to make it work for a tokenizer other than Llama 3. The hardcoding at the end is from me reverse engineering by hand and won't apply to other models.
(For what it's worth, the iteration over 256256256 is horribly inefficient, since unicode is sparse on the initial bits of most bytes, but I didn't bother to fix it. If you write something more robust you might also want to replace that.)
@am-bean - I confirm your fix works for Llama3!
Any pointers on where to look in the Guidance repo to understand how to customize a similar hack that you curated for Llama3, for other LLMs, such as Mistral / Starling / et al? I'm not sure how you put the pieces together / I haven't crawled the repo myself for clues. Thanks for any insights!
@lashmore - Most of what I needed to work this out is actually from Hugging Face and not Guidance itself. The thing to look for is what tokenisers are being used and how they encode unicode values. The tokenizers
package is the most relevant.
Seeing as this issue keeps attracting attention, maybe I can find time to write a better version of the workaround and add a PR...
@am-bean please let us know if you do end up doing that! I'm looking at the tokenizers package right now. I see Mistral, Starling and LLama3 all use SentencePieceBPETokenizer tokenization but I'll need to stare at it much longer to understand how to construct the add_byte_decoder(). Thanks for the pointer, and do let us know if you go ahead and work a PR!
Currently I have tried 4 models that fit in my gpu:
Hey @arbitropy (and everyone reading), really appreciate you testing these out for us. We've been updating our support and infrastructure in our release candidate, and tried to improve the tokenizer loading logic there. Would you mind trying with the pre-release guidance (pip install guidance --pre
) and seeing if that does any better?
If they still don't work on the release candidate, nailing down support for gemma/llama/command r is definitely something we need to investigate
@riedgar-ms @nking-1 @hudson-ai FYI
The bug
It seems that many models loaded with
models.Transformers()
error out with:AssertionError: The passed tokenizer does have a byte_decoder property and using a standard gpt2 byte_decoder fails!
(side note: I think this error intends to say "does not have a byte_decoder property")
Is this expected behavior? If so, it needs more docs.
To Reproduce
The model id can be replaced with many others and I get the same error -- mistral, llama 2, etc.
System info (please complete the following information):
guidance.__version__
): 0.1.13