LlamaCpp model crashes with multi-token characters

knilink commented 3 months ago

The bug A strings containing certain unicode characters to causes an exception. Likely because 歪 is a multi-token characters for this tokenizer

llama3.engine.tokenizer('歪'.encode('utf8')) -> [15722, 103]

I also tested transformers model which seems to be working fine

To Reproduce

from guidance import models, select
llama3 = models.LlamaCpp('./Meta-Llama-3-8B-Instruct.Q4_0.gguf')
llama3 + '歪' + select(['打正着','门邪道'])

terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid character
Aborted

System info (please complete the following information): Ubuntu 22.04 Python 3.10.12

guidance==0.1.15 llama_cpp_python==0.2.79

Harsha-Nori commented 3 months ago

Hi @knilink , thanks for reporting this! Do you know if this happens if you try to generate with llama-cpp-python directly? Getting the full stack trace here would be very helpful!

@paulbkoch might have thoughts here too

knilink commented 2 months ago

Hi @Harsha-Nori I did a bit more investigation and can confirm the error was caused by sending incomplete Unicode bytes to llama_cpp tokenizer

$ printf '\xe6\xad' | ./llama-tokenize -m ./Meta-Llama-3-8B-Instruct.Q8_0.gguf --stdin
terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid character
Aborted

After adding byte_string.decode('utf8') before https://github.com/guidance-ai/guidance/blob/337738322f7d09f36613a4c40f86137c3a0a1553/guidance/models/llama_cpp/_llama_cpp.py#L78 I got the following stack trace:

``` --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) Cell In[21], line 4 2 # guidance.models._model.ipython_is_imported = False 3 llama3 = LlamaCpp('/home/jovyan/cache/Meta-Llama-3-8B-Instruct.Q8_0.gguf', file_name='',chat_template=chat.Llama3ChatTemplate, n_gpu_layers=-1,) ----> 4 llama3 + '歪' + select(['打正着','门邪道']) + gen(stop='。') File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:1159, in Model.__add__(self, value) 1157 # run stateless functions (grammar nodes) 1158 elif isinstance(value, GrammarFunction): -> 1159 out = lm._run_stateless(value) 1161 # run stateful functions 1162 else: 1163 out = value(lm) File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:1364, in Model._run_stateless(self, stateless_function, temperature, top_p, n) 1362 delayed_bytes = b"" 1363 # last_is_generated = False -> 1364 for chunk in gen_obj: 1365 1366 # we make everything full probability if we are not computing uncertainty 1367 # if not self.engine.compute_log_probs: 1368 # chunk.new_bytes_prob = 1.0 1369 1370 # convert the bytes to a string (delaying if we don't yet have a valid unicode string) 1371 lm.token_count += chunk.new_token_count 1372 chunk.new_bytes = delayed_bytes + chunk.new_bytes File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:732, in Engine.__call__(self, parser, grammar, ensure_bos_token) 717 def __call__(self, parser, grammar, ensure_bos_token=True): 718 """Returns a new updated parser state executed through the grammar. 719 720 Parameters (...) 729 This is the grammar we are extending the parser with. 730 """ --> 732 self.start(parser, grammar, ensure_bos_token) 734 logits = None 735 while True: File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:264, in Engine.start(self, parser, grammar, ensure_bos_token) 262 # run a simple tokenizer (that does not use a grammar) on the prefix for better performance 263 self._token_ids, self._token_byte_positions = self._tokenize_prefix(prompt) --> 264 self._token_ids, self._token_byte_positions = self._cleanup_tokens( 265 self._token_ids, self._token_byte_positions 266 ) 267 if len(self._token_byte_positions) > 0: 268 self._pre_parser_bytes = self._token_byte_positions[-1] File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:808, in Engine._cleanup_tokens(self, token_ids, token_byte_positions) 805 def _cleanup_tokens(self, token_ids, token_byte_positions): 806 807 # compute a joint tokenization --> 808 joint_token_ids = self._joint_tokenize(token_ids) 810 # see if we need to redo the tokenization 811 redo = False Cell In[20], line 151, in LlamaCppEngine._joint_tokenize(self, token_ids) 149 """What a full joint tokenizer would give for a given byte string""" 150 byte_string = b"".join([self.tokenizer.tokens[t] for t in token_ids]) --> 151 return self.tokenizer(byte_string) Cell In[20], line 81, in LlamaCppTokenizer.__call__(self, byte_string) 79 print('[LlamaCppTokenizer] begin', flush=True) 80 print(byte_string, flush=True) ---> 81 print(byte_string.decode('utf8'), flush=True) 82 res = self._model_obj.tokenize(byte_string, add_bos=False, special=True) 83 print('[LlamaCppTokenizer] end', flush=True) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 17-18: unexpected end of data ```

Transformer model didn't have the issue because its _joint_tokenize didn't use the tokenizer directly. I didn't do much testing but copy TransformersEngine._joint_tokenize over to LlamaCppEngine seem to get the issue fixed.

riedgar-ms commented 2 months ago

@knilink , thank you for bringing this up. I've drafted a (very) tentative fix in #962 , which works by chopping off bytes given to the encode() method until it has a valid UTF-8 string. However, I'm really concerned that this is going to be causing trouble for us elsewhere.

Have you filed your repro printf '\xe6\xad' | ./llama-tokenize -m ./Meta-Llama-3-8B-Instruct.Q8_0.gguf --stdin as a bug with llamacpp?

riedgar-ms commented 2 months ago

I have been doing some more prodding based on @knilink 's examples, and I've opened a bug on the HF repo whence I grabbed the model (although this does look like something going wrong at the LlamaCpp layer): https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/discussions/9

riedgar-ms commented 2 months ago

Also filed the bug on LlamaCpp https://github.com/ggerganov/llama.cpp/issues/8691

guidance-ai / guidance

LlamaCpp model crashes with multi-token characters #934