Mistral based models loaded with llama.cpp not working for complex tasks

gmonair commented 9 months ago

The bug When using a Mistral-7B based model, some basic examples work, while the more advanced ones error out. Using a LLama based model works on all examples.

To Reproduce

This snippet works as expected:

from guidance import models, gen, select

model = "../../models/openhermes-2.5-mistral-7b-16k.Q4_K_M.gguf" # TheBloke/openhermes-2.5-mistral-7b-16k

llama2 = models.LlamaCpp(model, n_gpu_layers=-1)

# llama2 is not modified, `lm` is a copy of `llama2` with 'This is a prompt' appended to its state
# lm = llama2 + 'This is a prompt'

lm = llama2 + 'Question: Luke has ten balls. He gives three to his brother.\n'
lm += 'How many balls does he have left?\n'
lm += 'Answer: ' + gen(regex='\d+')

expected output, the model correctly answers 7

Question: Luke has ten balls. He gives three to his brother.
How many balls does he have left?
Answer: 7

The following two snippets don't work:

from guidance import capture, Tool

import guidance
from guidance import one_or_more, select, zero_or_more
from guidance import models, gen, select

# model = "../../models/dolphin-llama2-7b.Q4_K_M.gguf"
model = "../../models/openhermes-2.5-mistral-7b-16k.Q4_K_M.gguf"

llama2 = models.LlamaCpp(model, n_gpu_layers=100)
# stateless=True indicates this function does not depend on LLM generations
@guidance(stateless=True)
def number(lm):
    n = one_or_more(select(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']))
    # Allow for negative or positive numbers
    return lm + select(['-' + n, n])

@guidance(stateless=True)
def operator(lm):
    return lm + select(['+' , '*', '**', '/', '-'])

@guidance(stateless=True)
def expression(lm):
    # Either
    # 1. A number (terminal)
    # 2. two expressions with an operator and optional whitespace
    # 3. An expression with parentheses around it
    return lm + select([
        number(),
        expression() + zero_or_more(' ') +  operator() + zero_or_more(' ') +  expression(),
        '(' + expression() + ')'
    ])

@guidance(stateless=True)
def calculator_call(lm):
    # capture just 'names' the expression, to be saved in the LM state
    return lm + 'calculator(' + capture(expression(), 'tool_args') + ')'

@guidance
def calculator(lm):
    expression = lm['tool_args']
    # You typically don't want to run eval directly for save reasons
    # Here we are guaranteed to only have mathematical expressions
    lm += f' = {eval(expression)}'
    return lm
calculator_tool = Tool(calculator_call(), calculator)
lm = llama2 + 'Here are five expressions:\ncalculator(3 *3) = 33\ncalculator(2 + 1 * 3) = 5\n'
lm += gen(max_tokens=30, tools=[calculator_tool], stop='\n\n')

Error:

UnboundLocalError: cannot access local variable 'sampled_token' where it is not associated with a value

import time
import guidance

from guidance import models, gen, select

# model = "../../models/dolphin-llama2-7b.Q4_K_M.gguf"
model = "../../models/openhermes-2.5-mistral-7b-16k.Q4_K_M.gguf"

llama2 = models.LlamaCpp(model, n_gpu_layers=100)

@guidance
def character_maker(lm, id, description, valid_weapons):
    lm += f"""\
    The following is a character profile for an RPG game in JSON format.
    ```json
    {{
        "id": "{id}",
        "description": "{description}",
        "name": "{gen('name', stop='"')}",
        "age": {gen('age', regex='[0-9]+', stop=',')},
        "armor": "{select(options=['leather', 'chainmail', 'plate'], name='armor')}",
        "weapon": "{select(options=valid_weapons, name='weapon')}",
        "class": "{gen('class', stop='"')}",
        "mantra": "{gen('mantra', stop='"')}",
        "strength": {gen('strength', regex='[0-9]+', stop=',')},
        "items": ["{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}"]
    }}```"""
    return lm
a = time.time()
lm = llama2 + character_maker(1, 'A nimble fighter', ['axe', 'sword', 'bow'])
time.time() - a

Error:

AssertionError: We can't consume any more tokens, but we are not yet done! Perhaps your model's token set is incomplete?

Both snippets work when using the llama2 based model

Here are five expressions:
calculator(3 *3) = 33
calculator(2 + 1 * 3) = 5
calculator(3 * 3) = 9
calculator(3 * 2) = 6
calculator(3 * 1) = 3

and

The following is a character profile for an RPG game in JSON format.
```json
{
    "id": "1",
    "description": "A nimble fighter",
    "name": "Torin",
    "age": 25,
    "armor": "chainmail",
    "weapon": "bow",
    "class": "rogue",
    "mantra": "Cunning",
    "strength": 10,
    "items": ["chainmail armor", "bow", "arrows"]
}```

System info (please complete the following information):

OS (e.g. Ubuntu):
Guidance Version (0.1.1):

hanszahm commented 9 months ago

I'm running into the same error using LeoLM/leo-hessianai-13b-chat - not using llama.cpp though. AssertionError: We can't consume any more tokens, but we are not yet done! Perhaps your model's token set is incomplete? Similar overall behaviour, some (simple) settings work, others fail. With the base Llama I don't get this error. At first I thought it was due to the mismatch of tokenizer vocab size and model vocab size. I did not get to test in thoroughly, but it seems that certain input lengths or characters seem to trigger the error.

ASmallPotato commented 9 months ago

I have similar problem, in my case, I simply can't use stop or stop_regex, so even the "simple" example of

lm = llama2 + 'Problem: Luke has a hundred and six balls. He then loses thirty six.\n'
lm += 'Equivalent arithmetic expression: ' + gen(stop='\n') + '\n'

give the error

AssertionError: We can't consume any more tokens, but we are not yet done! Perhaps your model's token set is incomplete?

While just using max_tokens works

lm = llama2 + 'Problem: Luke has a hundred and six balls. He then loses thirty six.\n'
lm += 'Equivalent arithmetic expression: ' + gen(max_tokens=15) + '\n'

returns normal output

Problem: Luke has a hundred and six balls. He then loses thirty six. Equivalent arithmetic expression: 106 - 36

Solution: Luke has

I tried to debug the issue by inserting some print statement, however the error site was too complicated and I can't really follows the code. However, I did add a print statement and when the error is thrown, the token_pos was 0 and the sample_token was repeating the last token of my prompt, which would be removed when called with max_token (I guess it was removed by token healing?) , see the output in below details.

Details

I added the highlighted line in guidance/models/_local.py

www@8bf11758c665:/var/www/app$ python
Python 3.9.18 (main, Nov  1 2023, 14:31:33)
[GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from guidance import models, gen, select
>>> llama2 = models.LlamaCpp('/mnt/models/mistral-7b-openorca.Q5_K_M.gguf', n_ctx=(1024*4))
>>> lm = llama2 + 'Problem: Luke has a hundred and six balls. He then loses thirty six.\n'
>>> lm += 'Equivalent arithmetic expression: ' + gen(max_tokens=15) + '\n'
None b'Equ'
None b'ivalent'
None b' ar'
None b'ith'
None b'metic'
None b' expression'
None b':'
1 b' '
1 b'1'
1 b'0'
1 b'6'
2 b' -'
1 b' '
1 b'3'
1 b'6'
1 b'\n'
1 b'\n'
1 b'S'
7 b'olution'
1 b':'
5 b' Luke'
4 b' has'
4 b'\n'
0 b'\n'
>>> print(lm)
Problem: Luke has a hundred and six balls. He then loses thirty six.
Equivalent arithmetic expression: 106 - 36

Solution: Luke has

\>>>
\>>> lm = llama2 + 'Problem: Luke has a hundred and six balls. He then loses thirty six.\n'
\>>> lm += 'Equivalent arithmetic expression: ' + gen(stop='\n') + '\n'
None b'Equ'
None b'ivalent'
None b' ar'
None b'ith'
None b'metic'
None b' expression'
None b':'
None b' '
0 b' '
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python3.9/site-packages/guidance/models/_model.py", line 242, in __add__
    out = lm._run_stateless(value)
  File "/usr/local/lib/python3.9/site-packages/guidance/models/_model.py", line 382, in _run_stateless
    for new_bytes, is_generated, new_bytes_log_prob, capture_groups, capture_group_log_probs, new_token_count in gen_obj:
  File "/usr/local/lib/python3.9/site-packages/guidance/models/_local.py", line 375, in __call__
    assert parser.matched(), "We can't consume any more tokens, but we are not yet done! Perhaps your model's token set is incomplete?"
AssertionError: We can't consume any more tokens, but we are not yet done! Perhaps your model's token set is incomplete?
\>>>
\>>> lm = llama2 + 'Problem: Luke has a hundred and six balls. He then loses thirty six.\n'
\>>> # This time without the quotation marks and space
\>>> lm += 'Equivalent arithmetic expression' + gen(stop='\n') + '\n'
None b'Equ'
None b'ivalent'
None b' ar'
None b'ith'
None b'metic'
None b' expression'
0 b' expression'
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python3.9/site-packages/guidance/models/_model.py", line 242, in __add__
    out = lm._run_stateless(value)
  File "/usr/local/lib/python3.9/site-packages/guidance/models/_model.py", line 382, in _run_stateless
    for new_bytes, is_generated, new_bytes_log_prob, capture_groups, capture_group_log_probs, new_token_count in gen_obj:
  File "/usr/local/lib/python3.9/site-packages/guidance/models/_local.py", line 375, in __call__
    assert parser.matched(), "We can't consume any more tokens, but we are not yet done! Perhaps your model's token set is incomplete?"
AssertionError: We can't consume any more tokens, but we are not yet done! Perhaps your model's token set is incomplete?

versions:

guidance==0.1.4
llama_cpp_python==0.2.19

ASmallPotato commented 9 months ago

@gmonair Sorry to bother you, but after some digging, It started working for me, can you try the following and see if it work for you?

Basically, from what I tried, models that doesn't uses '\~~' for BOS token and '\~~' for EOS token will fail, b/c this was hardcoded in guidance. coincidentally this commit changed the related code, and it solved the problem for me. Alternately, you can try using "neural-chat-7b-v3-1.Q5_K_M.gguf", which use the "correct" BOS and EOS token. If it also does it for you, then great. (FYI, if you load the model directly with llama.cpp or llama-cpp-python it will print the model's information, including the BOS and EOS token used)

However, since @hanszahm isn't using llama.cpp, it probably doesn't directly solve your/their problem. But it might be a similar issue tho.

freckletonj commented 8 months ago

I can confirm, without llama.cpp, using mixtral instruct gptq, I get:

Exception: We can't consume any more tokens, but we are not yet done! Perhaps your model's token set is incomplete? This happened after the prompt: ...

In my use case, this seems related to non english chars that are indeed proper unicode, but, not in Mixtral's tokenizer.

guidance-ai / guidance

Mistral based models loaded with llama.cpp not working for complex tasks #454