guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
18.8k stars 1.04k forks source link

Unconstrained gen() results in ConstraintException in Gemini #881

Open nking-1 opened 3 months ago

nking-1 commented 3 months ago

The bug When prompting Gemini with a simple gen() call, sometimes the prompt fails with a ConstraintException. This is likely the same issue as reported in https://github.com/guidance-ai/guidance/issues/866

To Reproduce This can be reproduced using the test_gemini_pro function in the test_googleai.py test file. You might have to run the code several times to reproduce the issue. Here's my slightly modified script:

import traceback
from guidance import assistant, gen, models, system, user
import os

def test_gemini_pro():

    try:
        vmodel = models.GoogleAI("models/gemini-1.5-flash", api_key=os.getenv("GEMINI_API_KEY"), echo=False)
    except Exception as e:
        traceback.print_exc()

    lm = vmodel

    with user():
        lm += "The economy is crashing!"

    with assistant():
        lm += gen("test1", max_tokens=100)

    with user():
        lm += "What is the best again?"

    with assistant():
        lm += gen("test2", max_tokens=100)

    # second time to make sure cache reuse is okay
    print(lm)
    lm = vmodel

    with user():
        lm += "The economy is crashing!"

    with assistant():
        lm += gen("test1", max_tokens=100)

    with user():
        lm += "What is the best again?"

    with assistant():
        lm += gen("test2", max_tokens=100)

    print(lm)

test_gemini_pro()

I've made some progress in debugging this but I'm missing some knowledge about the internals of Guidance to make the fix myself. Here's what I know so far.

When using temperature 0 (as in the code above) sometimes models will not return exactly the same string. I think that's the root cause of the issue. The second gen() call to the assistant fails:

    with assistant():
        lm += gen("test2", max_tokens=100)

It fails in get_logits() of _grammarless.py here:

                # check if we have already restarted once and so retrying by default is not likely to be helpful
                if restarted:
                    raise self._report_failed_match(prompt)

That calls _report_failed_match() which runs some code to determine where there was a mismatch in the generation. Here is a dump of the local variables at the end of that function:

prompt:
b'<|im_start|>user\nThe economy is crashing!<|im_end|><|im_start|>assistant\nIt\'s understandable to be concerned about the economy, especially when you hear phrases like "the economy is crashing." However, it\'s important to remember that:\n\n* **"Crashing" is a very strong word.**  While economic downturns are a normal part of the cycle, a true crash is a rare and severe event. \n* **The news often focuses on negative events.** This can create a sense of panic, even if the overall situation isn\'t as dire as it<|im_end|><|im_start|>user\nWhat is the best again?<|im_end|><|im_start|>assistant\nPlease give me more context!  "Best" is a very subjective term.  What are you looking for the best of?  For example:\n\n* **The best restaurant in town?**\n* **The best way to learn a new language?**\n* **The best movie of all time?**\n\nOnce you tell me what you\'re looking for, I can give you a more helpful answer! \n'

self._data:
b'<|im_start|>user\nThe economy is crashing!<|im_end|><|im_start|>assistant\nIt\'s understandable to be concerned about the economy, especially when you hear phrases like "the economy is crashing." However, it\'s important to remember that:\n\n* **"Crashing" is a very strong word.**  While economic downturns are a normal part of the cycle, a true crash is a rare and severe event. \n* **The news often focuses on negative events.** This can create a sense of panic, even if the overall situation isn\'t as dire as it<|im_end|><|im_start|>user\nWhat is the best again?<|im_end|><|im_start|>assistant\nPlease give me more context!  "What is the best" is a very broad question.  To help me give you a helpful answer, tell me:\n\n* **What are you looking for the best of?**  (e.g., the best restaurant, the best movie, the best way to learn a new language)\n* **What are your criteria for "best"?** (e.g., cheapest, most delicious, most educational)\n* **What'

leftover:
b'Best" is a very subjective term.  What are you looking for the best of?  For example:\n\n* **The best restaurant in town?**\n* **The best way to learn a new language?**\n* **The best movie of all time?**\n\nOnce you tell me what you\'re looking for, I can give you a more helpful answer! \n'

data_after_prompt:
b'What is the best" is a very broad questi...'

prompt_tail:
b'...ssistant\nPlease give me more context!  "'

System info (please complete the following information):

riedgar-ms commented 3 months ago

This looks suspiciously like other failures I've seen with remote endpoints. I added more logging in #879 to try to track down exactly where the grammar failure was occurring and the problem seems to have disappeared. This is making me suspect that we may have a race condition on GrammarlessEngine, but if so, I've not been able to track it down. The 'obvious' place of when the thread handling the actual HTTPS call is restarted looks OK to me.

nking-1 commented 3 months ago

After doing some more debugging I found that I can repro this much more often when using a higher temperature:

lm += gen("test1", max_tokens=100, temperature=0.8)

This can repro the bug with only 1 call to gen()

nking-1 commented 3 months ago

The bug also seems to repro more often with a longer prompt and higher max_tokens