langchain-ai / langchainjs

πŸ¦œπŸ”— Build context-aware reasoning applications πŸ¦œπŸ”—
https://js.langchain.com/docs/
MIT License
12.16k stars 2.05k forks source link

Typing of tool input in AgentAction is broken for StructuredTool, input error handling and StructuredChatOutputParser #2710

Closed RunOrVeith closed 6 months ago

RunOrVeith commented 11 months ago

This issue requires a bit of a lengthy explanation, but the overall problem is:

TLDR:

The types of StructuredTool, AgentAction, parsing error handling in AgentExecutor and StructuredChatOutputParser don't fit together, and it only typechecks kind of by accident at the moment. See also the summary at the bottom.

Explanation

I am going to explain the issue with an example. Let's assume we have a StructuredTool that can click on an element on a web page:

import { z } from "zod";
import { StructuredTool } from "langchain/tools";

const clickSchema = z.object({
  selector: z.string().describe("The query seletor to click on."),
});

const ClickSchema = z.infer<typeof clickSchema>;

class ClickTool extends StructuredTool<typeof clickSchema> {
  schema = clickSchema;
  name = "click";
  description =
    "left click on an element on a web page represented by a query selector";

  protected _call(arg: ClickSchema): Promise<string> {
    // We need to know that arg.selector is a thing here
    return this.click(arg.selector);
  }

  private click(selector: string): Promise<string> {

    return Promise.resolve(`Clicked on ${selector}`);
  }
}

When you look at the signature and implementation of StructuredTool.call, it seems like we already know what the input is, but in reality, the validation only happens inside of that function:

async call(
    arg: (z.output<T> extends string ? string : never) | z.input<T>,
    configArg?: Callbacks | RunnableConfig,
    /** @deprecated */
    tags?: string[]
  ): Promise<string> {
    let parsed;
    try {
      parsed = await this.schema.parseAsync(arg); // Only now we know that arg is z.input<T>!
    } catch (e) {
      throw new ToolInputParsingException(
        `Received tool input did not match expected schema`,
        JSON.stringify(arg)
      );
    }
    // ... more code

In our case, our schema can not be string, so this simplifies to the type { selector: string }. The signature reads like we have already made some assumptions about the input, where in reality we are only at a signature that looks like this:

call: async (arg: (z.output<T> extends string ? string : never) | Json, /*...*/) => Promise<String>;

But even that has more issues:

This is where the OutputParser comes in. The part that we really care about is the parse method:

This is where the string that the model produces is parsed to JSON, and errors are thrown if it is not valid JSON. A non-json string will throw a syntax error if passed into JSON.parse.

 /**
   * Parses the given text and returns an `AgentAction` or `AgentFinish`
   * object. If an `OutputFixingParser` is provided, it is used for parsing;
   * otherwise, the base parser is used.
   * @param text The text to parse.
   * @param callbacks Optional callbacks for asynchronous operations.
   * @returns A Promise that resolves to an `AgentAction` or `AgentFinish` object.
   */
  async parse(text: string): Promise<AgentAction | AgentFinish> {
    try {
      const regex = /```(?:json)?(.*)(```)/gs;
      const actionMatch = regex.exec(text);
      if (actionMatch === null) {
        throw new OutputParserException(
          `Could not parse an action. The agent action must be within a markdown code block, and "action" must be a provided tool or "Final Answer"`
        );
      }
      const response = JSON.parse(actionMatch[1].trim());
      const { action, action_input } = response;

      if (action === "Final Answer") {
        return { returnValues: { output: action_input }, log: text };
      }
      return { tool: action, toolInput: action_input || {}, log: text };
    } catch (e) {
      throw new OutputParserException(
        `Failed to parse. Text: "${text}". Error: ${e}`
      );
    }
  }

From parsing, we get an AgentAction (we can ignore AgentFinish for now) that looks like this:

export type AgentAction = {
    tool: string;
    toolInput: string;
    log: string;
};

toolInput is typed as string, even though we just used JSON.parse to get it! Consider the positive case where the model produced output that matches our schema, let's say the string "{\"selector\": \"myCoolButton\"}". Using JSON.parse, this will deserialize to an object { selector: "myCoolButton" }, and not a string. But because JSON.parse's return type is any, the typescript compiler has no chance of realizing this.

To understand why this is troublesome, we need to look into the execution loop where the AgentActions are used to actually invoke the tool. This happens here in AgentExecutor._call. I've split the relevant parts of the method into these two smaller methods and simplied a bit to show my point:

  // originally part of AgentExectutor._call
  async safePlanNextStep(previousSteps: AgentStep[]): Promise<AgentAction> {
    try {
      // This will return either an AgentAction or AgentFinish through outputParser.parse
      output = await this.agent.plan(steps, inputs);
    } catch (e) {
      if (e instanceof OutputParserException) {
        let observation;
        const text = e.message;
        if (this.handleParsingErrors === true) {
          observation = "Invalid or incomplete response";
        } else if (typeof this.handleParsingErrors === "string") {
          observation = this.handleParsingErrors;
        } else if (typeof this.handleParsingErrors === "function") {
          observation = this.handleParsingErrors(e);
        } else {
          throw e;
        }
        output = {
          tool: "_Exception",
          toolInput: observation,
          log: text,
        } as AgentAction;
      } else {
        throw e;
      }
    }
  }

This is where the parsing using the OutputParser comes in, and where its exceptions are handled. You can see that in the case of an error, the toolInput field will always be a string (if this.handleParsingErrors is a function, the return type is also string). But we have just seen above, that in the non-error case toolInput will be parsed JSON! This is inconsistent behavior, we never parse the output of handleParsingErrors to JSON, so we are now in a state where toolInput is sometimes a string, and sometimes parsed json.

The next step is to call the selected tool with the given input:

// Also part of AgentExecutor._call in the real code
async safeCallTool(action: AgentAction): Promise<AgentStep> {
  const tool =
    action.tool === "_Exception"
      ? new ExceptionTool()
      : toolsByName[action.tool?.toLowerCase()];
  let observation;
  try {
    observation = tool
      ? await tool.call(action.toolInput)
      : `${action.tool} is not a valid tool, try another one.`;
  } catch (e) {
    if (e instanceof ToolInputParsingException) {
      if (this.handleParsingErrors === true) {
        observation = "Invalid or incomplete tool input. Please try again.";
      } else if (typeof this.handleParsingErrors === "string") {
        observation = this.handleParsingErrors;
      } else if (typeof this.handleParsingErrors === "function") {
        observation = this.handleParsingErrors(e);
      } else {
        throw e;
      }
      observation = await new ExceptionTool().call(
        observation,
        runManager?.getChild()
      );
      return { action, observation: observation ?? "" };
    }
  }

  return { action, observation: observation ?? "" };
}

We only pass the previously computed output on to the tool in tool.call(action.toolInput)! We do not actually have any guarantees for the input types to the tool! And in case this causes another error, we re-use the same function to handle parsing errors that will return a string that is supposed to be the tool output in the error case.

Summary

When is that an actual issue?

We noticed that the model often times makes the same errors with the tool input format. Recall our ClickSchema, which is just { selector: string }. In our clicking example it would either generate according to the schema, or { element: string }, or just a string which was the value we want, like "myCoolButton".

So we built and auto-fixer for these common error cases. The fixer basically just checks whether it can fix the input using either of the options above. The earliest we can inject this code without overwriting a lot of the planning logic that LangChain provides is in StructuredTool.call. We can not handle it using handleParsingErrors, since that receives only the error as input, and not the causing text. Once you are overwriting StructuredTool.call, you are relying on the signature of that function to be correct, which we just saw is not the case. It would also be great if the corrected tool input could be serialized in the intermediate steps, which we can only do through some hacks at the moment, because the steps are not part of the error handling process. Separate issue for this: #2711 . At this point, you are stuck having to figure out all of the above to see why you are getting wrongly typed inputs to call and in the resulting intermediateSteps.

Improvement ideas

Unfortunately, anything that really fixes this is a breaking change. Nonetheless, this is what I would propose:

I would be willing to contribute here if we can find a good solution. I have not looked into the python code for this, but assume it has the same problem.

dosubot[bot] commented 11 months ago

πŸ€–

Thank you for your detailed analysis and suggestions on improving type safety and consistency in the LangChainJS framework. I agree with your points and here's how we can address them:

  1. StructuredTool class: We can modify the call function to only accept JSON as input. This will ensure that the input is always valid JSON and will eliminate the need for parsing inside the call function. Here's how you can do it:
async call(
  arg: z.input<T>,
  configArg?: Callbacks | RunnableConfig,
  /** @deprecated */
  tags?: string[]
): Promise<string> {
  let parsed;
  try {
    parsed = await this.schema.parseAsync(arg);
  } catch (e) {
    this.handleError(e);
  }

  // ...
}
  1. AgentAction class: We can change the type of toolInput field to JSON. This will ensure that the toolInput field is always valid JSON and will eliminate the need for parsing it to JSON. Here's how you can do it:
class AgentAction {
  toolInput: z.input<T>;

  // ...
}
  1. AgentActionExecutor class: We can extract the methods for error handling from AgentExecutor._call into their own methods. This will make the error handling more modular and easier to manage. Here's how you can do it:
class AgentActionExecutor {
  // ...

  private handleParsingErrors(e: Error): void {
    // Handle parsing errors here
  }

  private _call(action: AgentAction): void {
    try {
      // ...
    } catch (e) {
      this.handleParsingErrors(e);
    }
  }
}
  1. StructuredChatOutputParser class: We can modify the parse function to only return valid JSON. This will ensure that the output is always valid JSON and will eliminate the need for parsing it to JSON. Here's how you can do it:
class StructuredChatOutputParser {
  // ...

  parse(output: string): z.input<T> {
    // Parse the output to JSON here
  }
}

Please note that these changes will require updating all places where these classes and their methods are used. Also, these changes will make the code more type safe, but they may not cover all edge cases. Therefore, thorough testing is recommended after implementing these changes.

I hope this helps! If you have any further questions or need more clarification, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

RunOrVeith commented 11 months ago

@dosu-bot Your recommendations are exactly what I already proposed. You could help by refining them to not be a breaking change.

dosubot[bot] commented 11 months ago

πŸ€–

Thank you for your detailed proposal. Here are my suggestions on how to implement these changes without causing breaking changes:

  1. Modify the call function in StructuredTool class to only accept JSON as input. Instead of directly modifying the call function to only accept JSON, you could create a new method, say callJSON, which accepts only JSON. This method could internally call the existing call method after converting the JSON to the required type. This way, the existing functionality will not be affected.

    async callJSON(
     arg: JSON,
     configArg?: Callbacks | RunnableConfig,
     tags?: string[]
    ): Promise<string> {
     const convertedArg = convertJSONToRequiredType(arg); // Implement this function
     return this.call(convertedArg, configArg, tags);
    }
  2. Change the type of toolInput field in AgentAction class to JSON. Similar to the above suggestion, instead of directly changing the type of toolInput, you could add a new field, say toolInputJSON, which is of type JSON. You could then modify the code to populate both toolInput and toolInputJSON. This way, the existing code which relies on toolInput will not be affected.

    const toolInputJSON = function_call.arguments
     ? JSON.parse(function_call.arguments)
     : {};
    const toolInput = convertJSONToRequiredType(toolInputJSON); // Implement this function
  3. Extract the methods for error handling from AgentExecutor._call into their own methods. This change should not cause any breaking changes as long as the extracted methods are called in the same order and with the same arguments as in the original _call method. Make sure to thoroughly test the new implementation to ensure that it behaves exactly as the old one.

  4. Modify the parse function in StructuredChatOutputParser class to only return valid JSON. Instead of modifying the existing parse function, you could create a new method, say parseJSON, which only returns valid JSON. This method could internally call the existing parse method and then convert the result to JSON. This way, the existing functionality will not be affected.

    parseJSON(output: string): JSON {
     const result = this.parse(output);
     return convertToJSON(result); // Implement this function
    }

Please note that these are just suggestions and might need to be adapted to your specific use case. Also, remember to thoroughly test all changes to ensure that they do not introduce any new bugs.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

david1542 commented 9 months ago

I'm also experiencing this issue. Any news on that?

dosubot[bot] commented 6 months ago

Hi, @RunOrVeith,

I'm helping the langchainjs team manage their backlog and am marking this issue as stale. From what I understand, the issue you raised highlighted type inconsistency and parsing errors in the LangChainJS framework. Proposed improvements included modifying the call function in StructuredTool to only accept JSON, changing the type of toolInput field in AgentAction to JSON, and extracting error handling methods for better modularity. Dosubot provided detailed suggestions on implementing these changes without causing breaking changes, and another user, david1542, also expressed experiencing the same issue.

Could you please confirm if this issue is still relevant to the latest version of the langchainjs repository? If it is, please let the langchainjs team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days. Thank you!