turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

v0.1.3 lm format enforcer broken #485

Closed waterangel91 closed 3 weeks ago

waterangel91 commented 3 weeks ago

I have been using lm format enforcer for a while and after upgraded to v0.1.3 the integration broken (I use it exactly like the example notebook where the filters is part of the setting). def exllamav2_with_format_enforcer(prompt: str, parser: Optional[CharacterLevelParser] = None) -> str: if parser is None: settings.filters = [] else: settings.filters = [ExLlamaV2TokenEnforcerFilter(parser, tokenizer_data)] result = generator.generate_simple(prompt, settings, max_new_tokens, seed = 1234) return result[len(prompt):]

The issue happen to both the old generate and stream and the new dynamic generate.

Is it that i need to use filters outside the setting? Cause I noted that in the new generate function, there is the filters argument

turboderp commented 3 weeks ago

In the new generator the filters are applied per job rather than as part of the sampling settings. This is for the sake of batching, since the filters are stateful and every job needs its own filter state. It would look like:

outputs = generator.generate(
    prompt = "Once upon a time", 
    filters = [ExLlamaV2PrefixFilter(model, tokenizer, ", in the land of")],
    max_new_tokens = 100,
    add_bos = True
)

Or as a batch:

outputs = generator.generate(
    prompt = [
        "Once upon a time",
        "Once upon a time"
    ],
    filters = [
        [ExLlamaV2PrefixFilter(model, tokenizer, ", in the land of")],
        [ExLlamaV2PrefixFilter(model, tokenizer, " there lived a")]
    ]
    max_new_tokens = 100,
    add_bos = True
)

This also changes how the old generate_simple works. I can see I forgot to remove a line from the old example so it's still applying filters to the settings, but that field is ignored.

I guess it was an oversight not to add an assertion in generate_simple, but in any case, the change is similar for that function. So you should just be able to do this:

def exllamav2_with_format_enforcer(prompt: str, parser: Optional[CharacterLevelParser] = None) -> str:
    if parser is None:
        filters = [] 
    else:
        filters = [ExLlamaV2TokenEnforcerFilter(parser, tokenizer_data)]

    return generator.generate_simple(
        prompt, 
        settings,
        max_new_tokens,
        seed = 1234,
        filters = filters,  # <- Supply filters here
        completion_only = True,  # <- Same as truncating the result to [len(prompt:)] 
    )
waterangel91 commented 3 weeks ago

Thank you for your advice. I followed it and managed to resolve the issue. The combination of streaming and non streaming generator into one is really neat.