noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
994 stars 45 forks source link

Missing JSON opening brackets #92

Closed pinoloricato closed 2 months ago

pinoloricato commented 2 months ago

Same test code as in https://github.com/noamgat/lm-format-enforcer/issues/80 with LMFE version 0.9.6, now occasionally generates JSONs with missing opening bracket, e.g.

"airports":["A", "B"],"cost_of_flight":1250}

noamgat commented 2 months ago

Upon running your notebook I encountered the same issue in #80 again and found a problem with whitespaces. https://github.com/noamgat/lm-format-enforcer/commit/71736596792c8c7d1fe35a65af562135100aef95 After that I successfully finish the loop. Can you check if you still reproduce the error on v0.9.8?

pinoloricato commented 2 months ago

I just retested it on v0.9.8 and was still able to reproduce the missing opening bracket problem

noamgat commented 2 months ago

Very strange, I am unable to reproduce. The only error I get after hundreds of generations is when the generation ends due to num tokens and the object is not closed properly. Can you try to modify the notebook (set a certain seed etc) to consistently get this problem?

pinoloricato commented 2 months ago

I can reproduce the error consistently by fixing the seed in the completion call as follows:

response = lm("", logits_processor=logits_processors, max_tokens=None, seed= 316)

My version of llama-cpp-python is 0.2.62.

noamgat commented 2 months ago

OK, I was able to reproduce the bug, thanks! It comes down to the following problem:

self.decoder([6377])
'{"'
self.decoder([1, 6377])
'"'

It seems like the token 6377 decodes to {" on its own, but the sequence 1,6377 decodes to ". I am not sure why this happens, does anyone know?

noamgat commented 2 months ago

OK, this is a bug with llama.cpp

from _interals.py:

def detokenize(self, tokens: List[int]) -> bytes:
        assert self.model is not None
        output = b""
        size = 32
        buffer = (ctypes.c_char * size)()
        for token in tokens:
            n = llama_cpp.llama_token_to_piece(
                self.model, llama_cpp.llama_token(token), buffer, size
            )
            assert n <= size
            output += bytes(buffer[:n])
        # NOTE: Llama1 models automatically added a space at the start of the prompt
        # this line removes a leading space if the first token is a beginning of sentence token
        return (
            output[1:] if len(tokens) > 0 and tokens[0] == self.token_bos() else output
        )

However, llama tokenizer doesn't always add a leading space, as is with 6377 token. This causes llamacpp to remove the { from the beginning of the output, generating your bug. Will open a PR for llama-cpp-python.

noamgat commented 2 months ago

Please vote / comment on the issue so that it gets visibility and is merged soon.

pinoloricato commented 2 months ago

Excellent, thanks for digging into that!

noamgat commented 2 months ago

Merged in llama-cpp-python