Tokenization inconsistency between prompt and fast-forward at beginning of grammar

hudson-ai commented 1 month ago

(copy this into your test_ll.py)

class TestProcessPrompt:
    """
        Test 1:
            lm + ("pi = " + gen("number", regex="[0-9]+"))
        Test 2:
            (lm + "pi = ") + gen("number", regex="[0-9]+")

        Want to assert that both tests have the same tokens in the prompt.
    """
    t = [2930, 353, 29871]
    s = "pi = "
    g = gen("number", regex="[0-9]+")

    def test_sanity(self):
        # not a real test, just to make sure the test is set up correctly
        assert PhiTokenizer.instance().tokenize_str(self.s) == self.t

    def test_1(self):
        """
        lm + ("pi = " + gen("number", regex="[0-9]+"))
        """
        grm = self.s + self.g
        interp = llguidance.LLInterpreter(
            PhiTokenizer.ll_tokenizer(), json.dumps(grm.ll_serialize())
        )
        t = interp.process_prompt([])
        assert t == self.t

    def test_2(self):
        """
        (lm + "pi = ") + gen("number", regex="[0-9]+")
        """
        grm = self.g
        interp = llguidance.LLInterpreter(
            PhiTokenizer.ll_tokenizer(), json.dumps(grm.ll_serialize())
        )
        t = interp.process_prompt(
            PhiTokenizer.instance().tokenize_str(self.s)
        )
        assert t == self.t

If you can't get this to pass, I believe that it's acceptable to instead make this pass for

t = [1, 2930, 353, 29871]
s = "<s>pi = "

mmoskal commented 1 month ago

So this passes, if instead of process_prompt([]) you do process_prompt([29871]) (29871 is " " token) Generally, if the model had a BOS you should pass [BOS] not []; phi-3 is kind of weird in that all strings start with space, so I guess you would pass plain space as BOS. This is not a problem is the prompt is not empty.

hudson-ai commented 1 month ago

Hmm, I see... I'll have to revisit this. These tests were sort of a proxy for the fact that currently

lm + ("pi = " + gen("number", regex="[0-9]+"))

and

(lm + "pi = ") + gen("number", regex="[0-9]+")

have very different behaviors because the model sees different tokenizations.

Should be able to dig in a bit tomorrow

hudson-ai commented 3 weeks ago

I may still need to rejigger this function a bit to pass the BOS token to process_prompt, but I was able to achieve the behavior I expected by calling the tokenizer's recode function: https://github.com/guidance-ai/guidance/pull/951/commits/cf181774423315ca861cf67d0e65a3ad8342a741

See the proper end to end test here: https://github.com/hudson-ai/guidance/blob/362278a20bc021dba3c94e0d156f13d3d32b77ec/tests/model_integration/test_model.py#L73

microsoft / llguidance

Tokenization inconsistency between prompt and fast-forward at beginning of grammar #4