noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
994 stars 45 forks source link

LM Enforcer cause hanged generation and what is the Sampler setting #486 #110

Open waterangel91 opened 1 month ago

waterangel91 commented 1 month ago

I have been using LM enforcer for a while for function calling with exllamav2, and one in a while it will cause the exllama generation to hang.

Previously I just attributed it to the model to not be smart enough for function calling. However, I now can reliably produce the issue with a specific prompt and model. The strange thing is that the generation without lm enforcer is correct:

The prompt: ` conversation.... coordinator_agent:

``json

Correct result without using lm enforcer, just normal generation:

{
  "functions_calling": [
    {
      "reason": "The manager_agent has confirmed that they can speak English, which addresses the user's question directly.",
      "name": "QuestionAnswered",
      "arguments": {
        "question_answered": "True"
      }
    }
  ]
}

Could it be due to my sampler setting?

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.1
settings.top_k = 50
settings.top_p = 0.9
settings.min_p = 0.06
settings.token_repetition_penalty = 1.01
settings.temperature_last = False

The model above is wizardlm 8x22b which is quite good at functional calling, as it can be seen the raw response without llm enforcer is correct.

Any advice is appreciated. Currently, I suspect it got to do with the sampling setting, as in most generation i can get correct function_calling response with the same lm enforcer setting

noamgat commented 1 month ago

Interesting. I wonder if it could be related to if the token filtering in exllamav2 happens before or after the softmax is applied to the weights. If it is applied afterwards, the min_p requirement may prove impossible in some cases - if all of the tokens with p>0.06 after the softmax are marked illegal by LMFE.

By hang, do you mean freeze, or crash?

waterangel91 commented 1 month ago

By hang i mean, the terminal just kinda not responding. I cant not kill it by ctrl+c but need to force close the terminal. If you have any idea on what I can try just let me know.

The strange thing is without lm enforcer, it generated the function calling correctly, meaning the needed tokens are also the most likely tokens. However, it hanged when i ad lm enforcer.

In addition, the same code work when i am using another model (same prompt, same lm enforcer), so i dont know if it is an issue with the way i defined the lm enforcer or not.

Currently the way i defined the lm enforcer is a bit long winded.

Do you see any problem with how i create the token enforcer?

waterangel91 commented 1 month ago

I streamed the token and see what under the hood then turn out, hang just mean the generation is very very slow under certain prompt.

I track the len of pass_tokens list and it hit almost 32k possible tokens. Do you have recommendation for the generation setting for this case?

Also, i thought topk will kick in and limit it to top 50 token? sampler.py

        if len(filters) > 0:

            pass_tokens = None
            end_tokens = None
            for f in filters:

                pt, et = f.next()
                if pt is not None: pass_tokens = pt if pass_tokens is None else pass_tokens & pt
                if et is not None: end_tokens = et if end_tokens is None else end_tokens | et

            pass_tokens_list = list(pass_tokens)
            pass_tokens_list_len = len(pass_tokens_list)       # 31855 possible tokens
            pass_token_text = []
            # print(type(tokenizer.id_to_piece_with_special))
            for tok_id in pass_tokens_list:
                # print(tok_id)
                # print(type(tok_id))
                # print(tokenizer.id_to_piece_with_special[tok_id])
                pass_token_text.append(tokenizer.id_to_piece_with_special[tok_id])
thigger commented 4 weeks ago

I think I am experiencing the same bug: TabbyAPI, Exllama 0.1.4, A6000 48Gb, Command-R, Windows 10 Using the json_schema option in the API, temperature 0.0 I've not had issues using other models (Mixtral 8x7b, Phi-3-medium-128k) but generation appears to hang occasionally when using json_schema with Command-R. I can't see exactly what's happening but the CUDA use on the GPU drops to zero except for a brief blip (up to 10%) every ~50 seconds.

My case is similar in that the model produces valid JSON if json_schema is not enforced. I'm assuming this is likely to be an lm enforcer issue but appreciate that it could be at other levels!

waterangel91 commented 4 weeks ago

You can refer to my issue created on exllamav2 git hub also. I closed the issue but seems like no solution at the moment