Closed pinoloricato closed 2 months ago
Upon running your notebook I encountered the same issue in #80 again and found a problem with whitespaces. https://github.com/noamgat/lm-format-enforcer/commit/71736596792c8c7d1fe35a65af562135100aef95 After that I successfully finish the loop. Can you check if you still reproduce the error on v0.9.8?
I just retested it on v0.9.8 and was still able to reproduce the missing opening bracket problem
Very strange, I am unable to reproduce. The only error I get after hundreds of generations is when the generation ends due to num tokens and the object is not closed properly. Can you try to modify the notebook (set a certain seed etc) to consistently get this problem?
I can reproduce the error consistently by fixing the seed in the completion call as follows:
response = lm("", logits_processor=logits_processors, max_tokens=None, seed= 316)
My version of llama-cpp-python is 0.2.62.
OK, I was able to reproduce the bug, thanks! It comes down to the following problem:
self.decoder([6377])
'{"'
self.decoder([1, 6377])
'"'
It seems like the token 6377 decodes to {"
on its own, but the sequence 1,6377
decodes to "
. I am not sure why this happens, does anyone know?
OK, this is a bug with llama.cpp
from _interals.py:
def detokenize(self, tokens: List[int]) -> bytes:
assert self.model is not None
output = b""
size = 32
buffer = (ctypes.c_char * size)()
for token in tokens:
n = llama_cpp.llama_token_to_piece(
self.model, llama_cpp.llama_token(token), buffer, size
)
assert n <= size
output += bytes(buffer[:n])
# NOTE: Llama1 models automatically added a space at the start of the prompt
# this line removes a leading space if the first token is a beginning of sentence token
return (
output[1:] if len(tokens) > 0 and tokens[0] == self.token_bos() else output
)
However, llama tokenizer doesn't always add a leading space, as is with 6377 token. This causes llamacpp to remove the { from the beginning of the output, generating your bug. Will open a PR for llama-cpp-python.
Please vote / comment on the issue so that it gets visibility and is merged soon.
Excellent, thanks for digging into that!
Merged in llama-cpp-python
Same test code as in https://github.com/noamgat/lm-format-enforcer/issues/80 with LMFE version 0.9.6, now occasionally generates JSONs with missing opening bracket, e.g.
"airports":["A", "B"],"cost_of_flight":1250}