turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.52k stars 271 forks source link

LM Enforcer cause hanged generation and what is the Sampler setting #486

Closed remichu-ai closed 3 months ago

remichu-ai commented 3 months ago

I have been using LM enforcer for a while for function calling with exllamav2, and one in a while it will cause the exllama generation to hang.

Previously I just attributed it to the model to not be smart enough for function calling. However, I now can reliably produce the issue with a specific prompt and model. The strange thing is that the generation without lm enforcer is correct:

The prompt: ` conversation.... coordinator_agent:

``json

Correct result without using lm enforcer, just normal generation:

{
  "functions_calling": [
    {
      "reason": "The manager_agent has confirmed that they can speak English, which addresses the user's question directly.",
      "name": "QuestionAnswered",
      "arguments": {
        "question_answered": "True"
      }
    }
  ]
}

Could it be due to my sampler setting?

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.1
settings.top_k = 50
settings.top_p = 0.9
settings.min_p = 0.06
settings.token_repetition_penalty = 1.01
settings.temperature_last = False

The model above is wizardlm 8x22b which is quite good at functional calling, as it can be seen the raw response without llm enforcer is correct.

Any advice is appreciated. Currently, I suspect it got to do with the sampling setting, as in most generation i can get correct function_calling response with the same lm enforcer setting

turboderp commented 3 months ago

The settings look fine. Do you have a more complete example, including the schema and how you're initializing the filter?

Also, are you using the latest version of lm-format-enforcer? Perhaps if you could adapt this example with your prompt, schema and settings?

remichu-ai commented 3 months ago

Thanks for responding. Yes i am using latest lm_enforcer, just updated it today.

I will give your example a try tomorow and update again.

My full prompt as below:

<s>
<INST>user:
You are coodrinator_agent and is part of a team of agent. Your strengths and background:
You are the best at knowing if the answer or necessary info for human's question is already presented in the conversation. In that case select TaskDoneTool or QuestionAnswered.
If the answer is unavailable and info is insufficient yet, you are an expert to choose the next agent to continue the task.

Each agent have different capabilities and strengths with the collective goal of 
help human/ user with their task.

human to manager_agent:
Can you speak English?

system_instruction to manager_agent:
Think through human's question and plan out how can leverage on other agents to gather info or complete some tasks.
Following is the list of agents/ human that you have access to:
web_scraper_agent: An expert with webscraping, you help user to scrape website to extract content. You do not have search capability and can only scrape web pages if given URL
web_search_agent: An expert with web search, you help user to search for URL that contains information that user need. You do not have scrape capability, you can only search and find relevant web URL and short summary of its content.
analysis_agent: An expert analyst that can synthesize information to answer user question.

If there is no need to use any of the agents because you already have the final answer for human, then you can just answer.

manager_agent:
As the manager_agent, I have reviewed the capabilities of my team members and determined that for the task at hand—responding to whether I can speak English—there is no need to delegate this task to any of the specialized agents. My role as the manager requires me to possess strong communication skills in English, which includes understanding questions posed by users and providing clear and concise answers.

Given this context, I am fully equipped to address your question directly without additional support from the web_scraper_agent, web_search_agent, or analysis_agent. Therefore, I can confirm:

Yes, I can speak English.

Based on the conversation so far, select an agent to ask them some task.Following is the list of agents/ human and their strengths:
web_scraper_agent: An expert with webscraping, you help user to scrape website to extract content. You do not have search capability and can only scrape web pages if given URL
web_search_agent: An expert with web search, you help user to search for URL that contains information that user need. You do not have scrape capability, you can only search and find relevant web URL and short summary of its content.
analysis_agent: An expert analyst that can synthesize information to answer user question.

</INST>

Below are the functions available to you to use.
If you need to use multiple functions, please provide it in chronological order.
If user or system prompt request certain restriction (examples: can only use one tool, must use a specific etc; then please respect the instruction.)

[AVAILABLE_TOOLS]
TaskDoneTool:
{
  "title": "TaskDoneTool",
  "description": "Pick this tool if sufficient info required to answer human's question is available in the conversation",
  "type": "object",
  "properties": {
    "task_done": {
      "title": "Task Done",
      "description": "sufficient info gathered",
      "enum": [
        "True"
      ],
      "type": "string"
    }
  },
  "required": [
    "task_done"
  ]
}
---
QuestionAnswered:
{
  "title": "QuestionAnswered",
  "description": "Pick this tool if any agent had provided decent answer to the human's question",
  "type": "object",
  "properties": {
    "question_answered": {
      "title": "Question Answered",
      "description": "An agent had provided answer to human's question",
      "enum": [
        "True"
      ],
      "type": "string"
    }
  },
  "required": [
    "question_answered"
  ]
}
---
NextAgentTool:
{
  "title": "NextAgentTool",
  "description": "Select an agent and ask them a question to further",
  "type": "object",
  "properties": {
    "agent_name": {
      "title": "Agent Name",
      "description": "name of an agent to be asked with question",
      "enum": [
        "web_scraper_agent",
        "web_search_agent",
        "analysis_agent"
      ],
      "type": "string"
    },
    "question": {
      "title": "Question",
      "description": "Question to ask the selected agent",
      "type": "string"
    }
  },
  "required": [
    "agent_name",
    "question"
  ]
}
---

[/AVAILABLE_TOOLS]

IMPORTANT: If you use tool, please answer using the following schema as an dictionary where all the functions to be called and its argument is an array under "functions_calling" key.
Each item of the array consist of the function name and the arguments and reason why you select this tool. Provide the answer in json format, any explanation you want to provide, provide it in the "reason" field in the "functions_calling".

Example of answer with Tool usage:
{
  "functions_calling": [
    {
      "reason": "user want to know the weather in Boston"
      "name": "get_current_weather",
      "arguments": {
        "location": "Boston, MA",
        "unit": "fahrenheit"
      },
    },    
    {
      "reason": "user ask about iPhone price"
      "name": "google_search",
      "arguments": {
        "search_string": "iPhone Price",
        "number_of_research": "3",
      },
    },
  ]
}
End of Example of answer with Tool usage.

</s>

coordinator_agent: 

```json

Here is the same conversation flow using mixtral-8x7B instead wizardlm-8x22B. And the tool was picked correctly. The code in both case are the same, i just swap out the model.

<s>
<INST>user:
You are coodrinator_agent and is part of a team of agent. Your strengths and background:
You are the best at knowing if the answer or necessary info for human's question is already presented in the conversation. In that case select TaskDoneTool or QuestionAnswered.
If the answer is unavailable and info is insufficient yet, you are an expert to choose the next agent to continue the task.

Each agent have different capabilities and strengths with the collective goal of 
help human/ user with their task.

human to manager_agent:
Can you speak English?

system_instruction to manager_agent:
Think through human's question and plan out how can leverage on other agents to gather info or complete some tasks.
Following is the list of agents/ human that you have access to:
web_scraper_agent: An expert with webscraping, you help user to scrape website to extract content. You do not have search capability and can only scrape web pages if given URL
web_search_agent: An expert with web search, you help user to search for URL that contains information that user need. You do not have scrape capability, you can only search and find relevant web URL and short summary of its content.
analysis_agent: An expert analyst that can synthesize information to answer user question.

If there is no need to use any of the agents because you already have the final answer for human, then you can just answer.

manager_agent:
Yes, I can speak English. I am the manager of all the AI agents and I will assess and coordinate their works to achieve the task or question that you ask. If I have doubt, I will clarify with you. If I still need to wait for other agents to perform tasks before I can provide the answer, I will let you know to wait. I do not have access to real-time information, so if I think my information is out of date, I will leverage other agents to gather necessary information. Each agent has different capabilities and strengths with the collective goal of helping you with your task. In this case, I do not need to use any of the other agents to answer your question, so I can confirm that I can indeed speak English.

Based on the conversation so far, select an agent to ask them some task.Following is the list of agents/ human and their strengths:
web_scraper_agent: An expert with webscraping, you help user to scrape website to extract content. You do not have search capability and can only scrape web pages if given URL
web_search_agent: An expert with web search, you help user to search for URL that contains information that user need. You do not have scrape capability, you can only search and find relevant web URL and short summary of its content.
analysis_agent: An expert analyst that can synthesize information to answer user question.

</INST>

Below are the functions available to you to use.
If you need to use multiple functions, please provide it in chronological order.
If user or system prompt request certain restriction (examples: can only use one tool, must use a specific etc; then please respect the instruction.)

TaskDoneTool:
{
  "title": "TaskDoneTool",
  "description": "Pick this tool if sufficient info required to answer human's question is available in the conversation",
  "type": "object",
  "properties": {
    "task_done": {
      "title": "Task Done",
      "description": "sufficient info gathered",
      "enum": [
        "True"
      ],
      "type": "string"
    }
  },
  "required": [
    "task_done"
  ]
}
---
QuestionAnswered:
{
  "title": "QuestionAnswered",
  "description": "Pick this tool if any agent had provided decent answer to the human's question",
  "type": "object",
  "properties": {
    "question_answered": {
      "title": "Question Answered",
      "description": "An agent had provided answer to human's question",
      "enum": [
        "True"
      ],
      "type": "string"
    }
  },
  "required": [
    "question_answered"
  ]
}
---
NextAgentTool:
{
  "title": "NextAgentTool",
  "description": "Select an agent and ask them a question to further",
  "type": "object",
  "properties": {
    "agent_name": {
      "title": "Agent Name",
      "description": "name of an agent to be asked with question",
      "enum": [
        "web_scraper_agent",
        "web_search_agent",
        "analysis_agent"
      ],
      "type": "string"
    },
    "question": {
      "title": "Question",
      "description": "Question to ask the selected agent",
      "type": "string"
    }
  },
  "required": [
    "agent_name",
    "question"
  ]
}
---

IMPORTANT: If you use tool, please answer using the following schema as an dictionary where all the functions to be called and its argument is an array under "functions_calling" key.
Each item of the array consist of the function name and the arguments and reason why you select this tool. Provide the answer in json format, any explanation you want to provide, provide it in the "reason" field in the "functions_calling".

Example of answer with Tool usage:
{
  "functions_calling": [
    {
      "reason": "user want to know the weather in Boston"
      "name": "get_current_weather",
      "arguments": {
        "location": "Boston, MA",
        "unit": "fahrenheit"
      },
    },    
    {
      "reason": "user ask about iPhone price"
      "name": "google_search",
      "arguments": {
        "search_string": "iPhone Price",
        "number_of_research": "3",
      },
    },
  ]
}
End of Example of answer with Tool usage.

</s>

coordinator_agent: 

```json

DEBUG:    | ----------------------temperature---------
0.1
DEBUG:    | ----------------------LLM Raw Response---------------
{
  "functions_calling": [
    {
      "name": "QuestionAnswered",
      "arguments": {
        "question_answered": "True"
      },
      "reason": "I can answer the user's question directly without needing to use any other agents."
    }
  ]
}

My lm-enforcer creation is a bit long winded as it dynamically created based on the list of tool that hit the API call as further below. But as you can see, the same lm enforcer works under mixtral and the tool set (which is the schema that created the lm enforcer) is the same.

And what dumbfounded me is when I get the correct response without using lm enforcer contrain at all for the wizard lm above. I am thinking to try to get the top k token out and see what it is


def create_function_models(functions: Dict[str, Type[BaseModel]]) -> List[Type[BaseModel]]:
    """ create a list of pydantic models for the function schemas that passed in via OpenAI request call"""
    function_model_list: List[Type[BaseModel]] = []
    for func_name, arg_model in functions.items():
        # Dynamic Pydantic model creation
        NewModel = create_model(
            func_name.title(),
            reason=(str, ...),  # Adding the 'reason' field
            name=(Literal[func_name], ...), # ... mean required
            arguments=(arg_model, ...),
            __config__=type('Config', (BaseModel.Config,), {'arbitrary_types_allowed': True})  # Nested Config class
        )
        function_model_list.append(NewModel)
    return function_model_list

class ToolCalling(BaseModel):
    """ The format to call one or multiple tools """
    functions_calling: List[Union[tuple(tool_combined_pydantic)]] = Field(
        description='the list of functions to call in chronological order',
        default=[]
    )

def replace_refs_with_definitions_v1(schema: Dict[str, Any], definitions: Dict[str, Any] = None) -> Dict[str, Any]:
    """
    Recursively replace all `$ref` in schema with definitions.
    """
    if definitions is None:
        definitions = schema.get('definitions', {})

    if isinstance(schema, dict):
        if '$ref' in schema:
            ref_path = schema['$ref']
            assert ref_path.startswith('#/definitions/'), f"Unhandled $ref format: {ref_path}"
            ref_name = ref_path.split('/')[-1]
            # Proceed to replace with the actual schema definition
            # Important: We make a deep copy to avoid unintentional modifications
            return Tools.replace_refs_with_definitions_v1(definitions[ref_name], definitions)
        else:
            # Recursively replace in all dictionary items
            return {k: Tools.replace_refs_with_definitions_v1(v, definitions) for k, v in schema.items()}
    elif isinstance(schema, list):
        # Recursively replace in all list items
        return [Tools.replace_refs_with_definitions_v1(item, definitions) for item in schema]
    return schema

answer_format_schema = tool_handler.replace_refs_with_definitions_v1(ToolCalling.schema())   # Tool calling is forced
lm_enforcer_parser = JsonSchemaParser(answer_format_schema)

response, gen_stats = self.generate(
    prompt,
    temperature=query.temperature,
    lm_enforcer_parser=lm_enforcer_parser,
)

def generate(
        self,
        prompt: str,
        temperature: float = 0.01,
        lm_enforcer_parser: TokenEnforcerTokenizerData = None,
        **kwargs,
) -> (str, GenerationStats):

    logger.info("----------------------Prompt---------------\n" + prompt)
    logger.debug("----------------------temperature---------\n" + str(temperature))

    # get generation setting
    settings = self._get_exllama_gen_settings(temperature)

    # convert prompt to token id
    input_ids = self.tokenizer.encode(prompt)
    self.validate_token_length(len(input_ids[0]))

    # format enforcer
    filters = None
    if lm_enforcer_parser:
        filters = [ExLlamaV2TokenEnforcerFilter(
            lm_enforcer_parser,
            self.pipeline.lm_enforcer_tokenizer_data)
        ]

    job = ExLlamaV2DynamicJob(
        input_ids=input_ids,
        max_new_tokens=self.max_tokens-len(input_ids[0]),
        gen_settings=settings,
        stop_conditions=self.eos_token_id if self.eos_token_id else None,
        decode_special_tokens=True,
        filters=filters,
    )
    self.pipeline.generator.enqueue(job)

    generate_text = ""
    eos = False
    while not eos:

        # Run one iteration of the generator. Returns a list of results
        results = self.pipeline.generator.iterate()

        for result in results:

            # If we enqueue multiple jobs, an iteration might produce results for any (or all) of them. We could direct
            # outputs to multiple clients here, using whatever dispatch mechanism, but in this example there will only be
            # outputs pertaining to the single job started above, and it will all go straight to the console.
            assert result["job"] == job

            # Prefilling/ingesting the prompt may happen over multiple iterations, during which the result will have
            # a "stage" value of "prefill". We can ignore those results and only use the "streaming" results that will
            # contain the actual output.
            if result["stage"] == "streaming":

                # Depending on settings, the result dict can contain top-K probabilities, logits and more, but we'll just
                # grab the output text stream.
                generate_text += result.get("text", "")

                # The "streaming" stage also emits the EOS signal when it occurs. If present, it will accompany a
                # summary of the job. Print the last packet here to illustrate.
                if result["eos"]:
                    eos = True
                    gen_stats = GenerationStats(
                        input_tokens_count=result["prompt_tokens"],
                        output_tokens_count=result["new_tokens"],
                        time_to_first_token=result["time_prefill"],
                        time_generate=result["time_generate"],
                    )

    logger.debug("----------------------LLM Raw Response---------------\n" + result["full_completion"])

    return generate_text, gen_stats
remichu-ai commented 3 months ago

Also, do you think it could be quantization issue? I am running 3.5bpw

should i try 4.0bpw

turboderp commented 3 months ago

I don't know if that's enough for me to reproduce it here. The generation code looks fine, and I suspect it has something to do with how the schema is constructed and maybe lm-format-enforcer not being happy with some condition it can't resolve.

The filter is applied here in sampler.py. The relevant code:

        if len(filters) > 0:

            pass_tokens = None
            end_tokens = None
            for f in filters:

                pt, et = f.next()
                if pt is not None: pass_tokens = pt if pass_tokens is None else pass_tokens & pt
                if et is not None: end_tokens = et if end_tokens is None else end_tokens | et

            if pass_tokens is not None:
                assert pass_tokens, "Filter excluded all tokens"
                if filter_prefer_eos and tokenizer.eos_token_id in pass_tokens:
                    pass_tokens = { tokenizer.eos_token_id }
                ext_c.logit_filter_exclusive(logit_filter, [sorted(list(pass_tokens))])

At the end of this, pass_tokens must contain some set of valid tokens to sample from, according to the active filters and their current state. If the set is empty you'll get an exception (should never happen.) The filter is then applied to the logits before the softmax and all the other sampling functions. So there's no way this should be able to hang as long as lm-format-enforcer returns at all, though it may raise an error if there are no tokens that satisfy the filter state.

An easy way verify would be some print statements before and after f.next(). If that's where it hangs, there is most likely either a bug in lm-format-enforcer or the schema somehow ends up being invalid. Perhaps if the schema is invalid, it's the ExLlamaV2TokenEnforcerFilter constructor that hangs?

One thing that can happen with lm-format-enforcer is that it considers leading whitespace to conform to a JSON schema (I guess it does, technically), so if the model for some reason doesn't want to emit the leading {, it may spit out whitespace instead for a while. Doesn't seem likely with all that context and given that it emits the correct JSON without the filter. But maybe.. somehow?

Two other things to try would be:

remichu-ai commented 3 months ago

I followed your advice and managed to trace a few thing:

I add the print out here:

        if len(filters) > 0:

            pass_tokens = None
            end_tokens = None
            for f in filters:

                pt, et = f.next()
                if pt is not None: pass_tokens = pt if pass_tokens is None else pass_tokens & pt
                if et is not None: end_tokens = et if end_tokens is None else end_tokens | et

            pass_tokens_list = list(pass_tokens)
            pass_token_text = []
            print(type(tokenizer.id_to_piece_with_special))
            for tok_id in pass_tokens_list:
                print(tok_id)
                print(type(tok_id))
                print(tokenizer.id_to_piece_with_special[tok_id])
                pass_token_text.append(tokenizer.id_to_piece_with_special[tok_id])

            if pass_tokens is not None:
                assert pass_tokens, "Filter excluded all tokens"
                if filter_prefer_eos and tokenizer.eos_token_id in pass_tokens:
                    pass_tokens = { tokenizer.eos_token_id }
                ext_c.logit_filter_exclusive(logit_filter, [sorted(list(pass_tokens))])

the pass_token_text is as follow, and there are a few { token here:

['\r', '  ', '    ', '\r\r', '\t', '\n', '       ', '           ', '{"', '        ', '\r', ' \r', ' ', ' ', '         ', '      ', '            ', ' {\r', '{', '{\r', '     ', ' {"', '   ', ' {', '          ', ' {}', '{}', '{']

The output_token and output_probs as follow: I checked and token_id 28751 is '{' which is the correct and expected token.

        output_tokens = torch.empty((batch_size, 1), device = "cpu", dtype = torch.long)  # tensor([[28751]])
        output_probs = torch.empty((batch_size, 1), device = "cpu", dtype = torch.float)   # tensor([[1.]])

After the sample_basic, m = []

        m = ext_c.sample_basic(      # m = []
            logits,
            1.0 if settings.temperature_last else settings.temperature,
            settings.top_k,
            settings.top_p,
            settings.top_a,
            settings.min_p,
            settings.tfs,
            settings.typical,
            random,
            output_tokens,
            output_probs,
            output_kprobs,
            output_ktokens,
            logit_filter,
            settings.mirostat,
            settings.mirostat_mu if settings.mirostat else [],
            settings.mirostat_tau,
            settings.mirostat_eta,
            settings.temperature if settings.temperature_last else 1.0,
            settings.min_temp,
            settings.max_temp,
            settings.temp_exponent,
            settings.smoothing_factor,
            settings.skew
        )

This part return output_tokens with { as expected. Let me continue checking and update back once i figure out where is wrong

        return output_tokens, output_ktokens, output_kprobs, output_probs, end_filter
remichu-ai commented 3 months ago

After tracing through the generation further, seems like it can generate the output and break at the array where it should give instead

DEBUG:    | ----------------------temperature---------
0.1
{

 "
functions
_
call
ing
":
ERROR:root:Unknown LMFormatEnforcer Problem. Prefix: '{
  "functions_calling":'
Terminating the parser. Please open an issue at 
https://github.com/noamgat/lm-format-enforcer/issues with the prefix and CharacterLevelParser parameters
Traceback (most recent call last):
  File "/home/remichu/miniconda3/envs/mlenv/lib/python3.11/site-packages/lmformatenforcer/tokenenforcer.py", line 96, in _compute_allowed_tokens
    self._collect_allowed_tokens(state.parser, self.tokenizer_tree.root, allowed_tokens, shortcut_key)
  File "/home/remichu/miniconda3/envs/mlenv/lib/python3.11/site-packages/lmformatenforcer/tokenenforcer.py", line 144, in _collect_allowed_tokens
    self._collect_allowed_tokens(next_parser, next_tree_node, allowed_tokens, None)
  File "/home/remichu/miniconda3/envs/mlenv/lib/python3.11/site-packages/lmformatenforcer/tokenenforcer.py", line 142, in _collect_allowed_tokens
    next_parser = parser.add_character(character)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/remichu/miniconda3/envs/mlenv/lib/python3.11/site-packages/lmformatenforcer/jsonschemaparser.py", line 74, in add_character
    updated_parser.object_stack[receiving_idx] = updated_parser.object_stack[receiving_idx].add_character(new_character)
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/remichu/miniconda3/envs/mlenv/lib/python3.11/site-packages/lmformatenforcer/jsonschemaparser.py", line 627, in add_character
    item_parser = get_parser(self.root, self.list_member_type)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/remichu/miniconda3/envs/mlenv/lib/python3.11/site-packages/lmformatenforcer/jsonschemaparser.py", line 265, in get_parser
    raise Exception("Unsupported type " + str(value_schema.type))
Exception: Unsupported type None

Think from above, quite confirm that it is either my schema issue or bug inside lm enforcer. I will check further and update once i figure out why.

remichu-ai commented 3 months ago

After fixing the issue above, i am back to the original issue where the generation "hanged", i streamed the token and see what under the hood then turn out, hang just mean the generation is very very low under certain prompt.

I track the len of pass_tokens list and it hit almost 32k possible tokens. Do you have recommendation for the generation setting for this case?

Also, i thought topk will kick in and limit it to top 50 token?

        if len(filters) > 0:

            pass_tokens = None
            end_tokens = None
            for f in filters:

                pt, et = f.next()
                if pt is not None: pass_tokens = pt if pass_tokens is None else pass_tokens & pt
                if et is not None: end_tokens = et if end_tokens is None else end_tokens | et

            pass_tokens_list = list(pass_tokens)
            pass_tokens_list_len = len(pass_tokens_list)       # 31855 possible tokens
            pass_token_text = []
            # print(type(tokenizer.id_to_piece_with_special))
            for tok_id in pass_tokens_list:
                # print(tok_id)
                # print(type(tok_id))
                # print(tokenizer.id_to_piece_with_special[tok_id])
                pass_token_text.append(tokenizer.id_to_piece_with_special[tok_id])
turboderp commented 3 months ago

ext_c.sample_basic will only have a return value when using Mirostat sampling. Otherwise results are written to the provided output_* tensors, so it's normal for m to be empty.

Top-K is applied by the extension function, which takes the whole logit tensor as input. It also takes a logit filter tensor which is a mask built from pass_tokens. This in turn holds all the tokens allowed by the filter for the current step, which can be a very large set sometimes. And you can't really limit it before handing off to lm-format-enforce, because it's not given that any of the top K tokens are actually allowed by the grammar. But 32k allowed tokens is not abnormal and it shouldn't slow it down all that much, at least not in the sampler.

It's possible that if the schema becomes too complicated it's just more than lm-format-enforcer can handle. Or maybe it's invalid somehow? Could you provide an example of what the schema dict ends up looking like?

remichu-ai commented 3 months ago

Thank you for the response. Below is my model, maybe it is too big like what you mentioned, i will try to unwrap a layer. I noted that by include a "reason" field make it failed as this field is supposedly for it to provide the reason why it choose this field, but maybe because it is free text.

I am trying to simulate some form of chain of thought prompting cause i realize sometime with lm enforcer, it give worse response than just normal generation. Maybe i should do 2 times generate for chain of thought for more reliable tool calling also.

{
    "$defs": {
        "NextAgentTool": {
            "description": "Select an agent and ask them a question to further",
            "properties": {
                "agent_name": {
                    "description": "name of an agent to be asked with question",
                    "enum": [
                        "web_scraper_agent",
                        "web_search_agent",
                        "analysis_agent"
                    ],
                    "title": "Agent Name",
                    "type": "string"
                },
                "question": {
                    "description": "Question to ask the selected agent",
                    "title": "Question",
                    "type": "string"
                }
            },
            "required": [
                "agent_name",
                "question"
            ],
            "title": "NextAgentTool",
            "type": "object"
        },
        "Nextagenttool": {
            "properties": {
                "short_reason": {
                    "title": "Short Reason",
                    "type": "string"
                },
                "name": {
                    "const": "NextAgentTool",
                    "title": "Name"
                },
                "arguments": {
                    "$ref": "#/$defs/NextAgentTool"
                }
            },
            "required": [
                "short_reason",
                "name",
                "arguments"
            ],
            "title": "Nextagenttool",
            "type": "object"
        },
        "QuestionAnswered": {
            "description": "Pick this tool if any agent had provided decent answer to the human's question",
            "properties": {
                "question_answered": {
                    "const": "True",
                    "description": "An agent had provided answer to human's question",
                    "title": "Question Answered"
                }
            },
            "required": [
                "question_answered"
            ],
            "title": "QuestionAnswered",
            "type": "object"
        },
        "Questionanswered": {
            "properties": {
                "short_reason": {
                    "title": "Short Reason",
                    "type": "string"
                },
                "name": {
                    "const": "QuestionAnswered",
                    "title": "Name"
                },
                "arguments": {
                    "$ref": "#/$defs/QuestionAnswered"
                }
            },
            "required": [
                "short_reason",
                "name",
                "arguments"
            ],
            "title": "Questionanswered",
            "type": "object"
        },
        "TaskDoneTool": {
            "description": "Pick this tool if sufficient info required to answer human's question is available in the conversation",
            "properties": {
                "task_done": {
                    "const": "True",
                    "description": "sufficient info gathered",
                    "title": "Task Done"
                }
            },
            "required": [
                "task_done"
            ],
            "title": "TaskDoneTool",
            "type": "object"
        },
        "Taskdonetool": {
            "properties": {
                "short_reason": {
                    "title": "Short Reason",
                    "type": "string"
                },
                "name": {
                    "const": "TaskDoneTool",
                    "title": "Name"
                },
                "arguments": {
                    "$ref": "#/$defs/TaskDoneTool"
                }
            },
            "required": [
                "short_reason",
                "name",
                "arguments"
            ],
            "title": "Taskdonetool",
            "type": "object"
        }
    },
    "description": "The format to call one or multiple tools ",
    "properties": {
        "functions_calling": {
            "default": [],
            "description": "the list of functions to call in chronological order",
            "items": {
                "anyOf": [
                    {
                        "$ref": "#/$defs/Taskdonetool"
                    },
                    {
                        "$ref": "#/$defs/Questionanswered"
                    },
                    {
                        "$ref": "#/$defs/Nextagenttool"
                    }
                ]
            },
            "title": "Functions Calling",
            "type": "array"
        }
    },
    "title": "ToolCalling",
    "type": "object"
}
remichu-ai commented 3 months ago

I changed the schema, instead of ask for reason in each function. I ask for one liner overall reason upfront. It seems to help by pass the slow generation issue.

But given how sensitive this is just by changing the prompt and data model, hope that in future we have more reliable method for function call.

My new schema, give response like below:

{
  "one_liner_internal_thought": "The manager_agent has confirmed the ability to speak English, so there is no need to use any additional tools for this specific query. The task is done, and the question has been answered comprehensively by the manager_agent's response. Therefore, we should indicate that the task is complete and the question has been answered satisfactorily. No further action is required from other agents at this moment as the necessary information has been provided directly in the response. The appropriate tool to select now is 'QuestionAnswered' since the manager_agent's statement sufficiently addresses the user's inquiry. There is no need to invoke 'TaskDoneTool' as 'QuestionAnswered' already implies that the task is complete when an answer has been provided. Thus, we will proceed with 'QuestionAnswered'.",
  "functions_calling": [
    {
      "name": "QuestionAnswered",
      "arguments": {
        "question_answered": "True"
      }
    }
  ]
}

If you have any other suggestion can let me know, else i can close the thread.

turboderp commented 3 months ago

I'm not sure. You are trying to do a lot with just a JSON constraint, and perhaps this calls for some more advanced scripting. I've considered integrations for Guidance, some kind of scripting language (maybe based on Lua), or a callback system to allow swapping out filters mid-stream combined with a bigger selection of filters.

In principle you could do something like this: (pseudocode)

reflection = generate(context + " <- reflect on that, please.")
answers = generate(
    [context + reflection + "Does that mean we should call the function " + f + "?" for f in functions],
    [SelectFilter(["Yes", "No"]) for f in functions]
)
function_requests = []
function_filters = []
for answer, function in zip(answers, functions):
    if answer == "Yes":
        function_requests.append(context + reflection + "Let's call the function " + function + " with these arguments:")
        function_filters.append(JSONFilter(get_schema_for_function(function)))
if len(function_requests):
    function_calls = generate(function_requests, function_filters)
    # .. call functions
else:
    # do something else...

It would take a bit more boilerplate, of course, but the point is you should be able to use this pattern efficiently thanks to prompt caching and deduplication. I'm definitely going to add some more features to facilitate this kind of "top down" explorative prompting.

But for your specific case with the large schema, perhaps you could ask in the lm-format-enforcer repo if there are better ways to structure it that would be less demanding to evaluate as a constraint.

remichu-ai commented 3 months ago

Thanks i will do something like that, actually i am already doing something like that when the api call come with tool_choice=auto. It is to simulate the Open AI way where it will return either normal text response or tool response in auto mode.

Even though there is boiler plate code, it work better than single generation and the 2nd generation speed quite fast due to kv cache like you mentioned.

Last question, given that the generation might go into this super slow mode unexpectedly, i want to implement some mechanism e.g how long since last generated few tokens and stop the generation accordingly. Is there any inbuilt mechanism for the dynamic generation function that i can use to stop the generation half way?

turboderp commented 3 months ago

There's no way to cancel during an iteration though (i.e. from a separate thread) but in between iterations you can call generator.cancel(job) to end the job and free up any cache pages it was using. If you're using the asyncio wrapper, you would have to call async_job.cancel().

remichu-ai commented 3 months ago

In between iteration, you refer to in between job or in between tokens generation?

turboderp commented 3 months ago

I mean at any time other than during the call to generator.iterate(). That function performs one iteration, which can include ingesting some number of prompt tokens or generating a single output token for some number of active jobs, and it can't be interrupted.

But at any point in the loop before or after that call you can call generator.cancel(job) to terminate a job. So for instance:

    generate_text = ""
    eos = False
    while not eos:
        if some_condition():
            self.pipeline.generator.cancel(job)  # <-- kill the job here
            break
        results = self.pipeline.generator.iterate()
        for result in results:
            assert result["job"] == job
            if result["stage"] == "streaming":
                generate_text += result.get("text", "")
                if result["eos"]:
                    eos = True
                    gen_stats = GenerationStats(
                        input_tokens_count=result["prompt_tokens"],
                        output_tokens_count=result["new_tokens"],
                        time_to_first_token=result["time_prefill"],
                        time_generate=result["time_generate"],
                    )

If you're using the generate() function (which just creates one or more jobs and performs a similar streaming loop), there's also a threading.Event you can pass as abort_event, and triggering that event from another thread will cause the generation to end as soon as possible.

remichu-ai commented 3 months ago

Thank you very much for the detail example and explanation 🙏👏

thigger commented 3 months ago

I am having the same issue with a much simpler schema (only two properties, both strings). I haven't had any issues with other models but Command-R is causing generation to hang (and eventually timeout) using TabbyAPI and Exllamav2 0.1.4

On the assumption this is an lm-enforcer issue I've added details here: https://github.com/noamgat/lm-format-enforcer/issues/110