noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

Tokenizer must not contain empty tokens Error #54

Closed unoriginalscreenname closed 8 months ago

unoriginalscreenname commented 9 months ago

Hi, I'm working with this library and llamacpp. I've started to get an error on certain models (not all) when running the build_token_enforcer_tokenizer_data funtion.

lmformatenforcer\tokenizerprefixtree.py", line 57, in freeze assert not any(t == '' for t in all_tokens), "Tokenizer must not contain empty tokens" AssertionError: Tokenizer must not contain empty tokens

This is happening on several models, but you should be able to reproduce it on NeuralHermes-2.5-Mistral-7B-GGUF.

Do you have any idea why this would be happening?

unoriginalscreenname commented 9 months ago

I have no idea what i'm doing... but i just removed the empty tokens from tokenizerprefixtree.py?

def freeze(self) -> None:
    """
    Precalculate token allowlists for all valid combinations of `min_remaining` and `max_len`
    based on the tokens that were added with `add_token()`.
    """
    all_tokens: List[str] = sorted(self.token_str_to_num.keys())
    all_tokens = [token for token in all_tokens if token != '']  # 🤷‍♂️
    assert all_tokens, "Cannot precalculate allowlists for an empty token list"
    assert not any(t == '' for t in all_tokens), "Tokenizer must not contain empty tokens"
noamgat commented 9 months ago

I think I fixed the problem (it came from an assumption from a recent PR). Can you try installing the library using:

pip install git+https://github.com/noamgat/lm-format-enforcer.git@bugfix/tokenizer_with_empty_tokens

And report if the crash is gone?

pinoloricato commented 8 months ago

This change fixes the error above for me, however it sometimes results in generating invalid JSONs:

Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', │ │ input_type=str]

kaciuk commented 8 months ago

I think I fixed the problem (it came from an assumption from a recent PR). Can you try installing the library using:

pip install git+https://github.com/noamgat/lm-format-enforcer.git@bugfix/tokenizer_with_empty_tokens

And report if the crash is gone?

I had the same issue and this fix solved it.

noamgat commented 8 months ago

Released in 0.8.2