Open riedgar-ms opened 7 months ago
Note that AFAICT the actions on the PR are being OoM (or disk space) killed. However, that's another problem.
In my case, this is due the mistral tokenizer fell back to fast tokenizer which made the sp_model
missing, installing sentencepiece
solved it for me.
But then I get error on cleanup tokens
So I mod it like this:
# ugly hack to deal with sentence peice craziness of space hiding after special tokens
# TODO: figure out how to make this more robust
diff = token_byte_positions[-1] - last_pos
if diff > 0:
for _ in range(diff):
if self.tokenizer.tokens[token_ids[0]] == b'<s>' \
and self.tokenizer.tokens[token_ids[1]][0:1] == b' ':
for i in range(1, len(token_byte_positions)):
token_byte_positions[i] -= 1
assert token_byte_positions[-1] == last_pos
Hmmm.... adding sentencepiece
to my pip installs is at least allowing my tests to get further. However, things are running a bit slowly, and I don't know if they will succeed yet.
Forgive me if I'm wrong,
The problem occurs because the default gpt2 byte encoder doesn't contain all of unicode characters.
This is from gpt2 byte_encoder
which mapped to bytes_to_unicode
:
list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
From GPT2Tokenizer
init function:
self.byte_encoder = bytes_to_unicode()
self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
String ’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨
fail since it contains characters that are not in the list.
So, the question is, is it necessary to check this string?
The assert
on that string should definitely be moved to a separate test. That might let some things work, but the underlying problem would still remain - the model can't cope with some valid unicode strings.
I think this should just give a warning instead.
I mean the original issue with mistral can already be solved with installing sentencepiece
.
The gpt2 is already a worst case scenario right? And realistically, it's not possible to support every model out there.
Just give a warning that the model have no byte decoder or some message to inform the user.
sentencepiece
hey, @yonitjio , I don't understand why installing sentencepiece would solve this problem. According to the code, it seems like it would still go to the branch that uses gpt2?
If you don't install sentencepiece
, the tokenizer will fallback to fast tokenizer which doesn't have sp_model
.
I mean the original issue with mistral can already be solved with installing
sentencepiece
.If you don't install
sentencepiece
, the tokenizer will fallback to fast tokenizer which doesn't havesp_model
.See here
I understand what you mean, but I'm currently using BloomTokenizer, and when using it, I can only set use_fast = True, because there is only the file tokenization_bloom_fast.py. This results in the tokenizer I get not having the two attributes byte_decoder and sp_model. Now I guess that for all fast tokenizers, their mapping relationship between bytes and unicode is the same as gpt2's, so gpt2.byte_decoder can be used as a substitute.
I suppose so.
But as I said before, I don't think it's realistic to support every model out there (for now?).
I can only think one other option instead of giving warning to user, that is to allow custom function for this.
The bug
On a freshly created conda environment, attempting to load
mistral-7b
via Hugging Face fails.To Reproduce
This is based on PR #741
I wind up with errors:
System info (please complete the following information):
guidance.__version__
): Synced fork