langchain-ai / langchainjs

πŸ¦œπŸ”— Build context-aware reasoning applications πŸ¦œπŸ”—
https://js.langchain.com/docs/
MIT License
12.3k stars 2.08k forks source link

Token usage is not available when using `stream` or `streamEvents` #6583

Open Stadly opened 3 weeks ago

Stadly commented 3 weeks ago

Checked other resources

Example Code

const prompt = HumanMessagePromptTemplate.fromTemplate(
  'Just respond with "test".',
);
const streamingChain = RunnableSequence.from([
  prompt,
  new ChatOpenAI({
    modelName: "gpt-4",
    streaming: true,
  }),
]);
const nonStreamingChain = RunnableSequence.from([
  prompt,
  new ChatOpenAI({
    modelName: "gpt-4",
    streaming: false,
  }),
]);
{
  const response = await nonStreamingChain.invoke({});
  // Token usage is reported when using invoke and non-streaming.
  console.log(
    response.usage_metadata, // { input_tokens: 13, output_tokens: 1, total_tokens: 14 }
    response.response_metadata, // { tokenUsage: { completionTokens: 1, promptTokens: 13, totalTokens: 14 } }
  );
}
{
  const response = await streamingChain.invoke({});
  // Token usage is estimated when using invoke and streaming.
  console.log(
    response.usage_metadata, // undefined
    response.response_metadata, // { prompt: 0, completion: 0, estimatedTokenUsage: { promptTokens: 13, completionTokens: 1, totalTokens: 14 } }
  );
}
{
  const stream = await nonStreamingChain.stream({});
  // No token usage when using stream, regardles of streaming/non-streaming.
  for await (const chunk of stream) {
    console.log(
      chunk.usage_metadata, // undefined
      chunk.response_metadata // { prompt: 0, completion: 0 }
    );
  }
}
{
  const eventStream = nonStreamingChain.streamEvents({}, { version: "v2" });
  // No token usage when using eventStream, regardles of streaming/non-streaming.
  for await (const event of eventStream) {
    if (event.data.output)
      console.log(
        event.data.output.usage_metadata, // undefined
        event.data.output.response_metadata, // { prompt: 0, completion: 0 }
      );
    if (event.data.chunk)
      console.log(
        event.data.chunk.usage_metadata, // undefined
        event.data.chunk.response_metadata, // { prompt: 0, completion: 0 }
      );
  }
}

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to get the token usage when using streamEvents. I understand that token usage is not reported when streaming with OpenAI. Langchain therefore estimates the token usage when calling invoke on a streaming model.

I would expect the same estimation to be done when calling stream or streamEvents, making the estimated token usage available in the result.

Related issues: https://github.com/langchain-ai/langchainjs/issues/2623 and https://github.com/langchain-ai/langchainjs/issues/6106

System Info

@langchain/core@0.2.27
@langchain/openai@0.2.7
Platform: linux
Node.js v20.16.0
NPM v10.8.1
dosubot[bot] commented 3 weeks ago

To get the estimated token usage when using stream or streamEvents in LangChain.js with OpenAI models, you can use the handleLLMEnd callback to capture the token usage information. Here is an example of how you can do this:

test("Test ChatOpenAI token usage reporting for streaming calls", async () => {
  process.env.LANGCHAIN_CALLBACKS_BACKGROUND = "false";

  try {
    let streamingTokenUsed = -1;
    let nonStreamingTokenUsed = -1;

    const systemPrompt = "You are a helpful assistant";
    const question = "What is the color of the night sky?";

    const streamingModel = new AzureChatOpenAI({
      modelName: "gpt-3.5-turbo",
      streaming: true,
      maxRetries: 10,
      maxConcurrency: 10,
      temperature: 0,
      topP: 0,
      callbacks: [
        {
          handleLLMEnd: async (output) => {
            streamingTokenUsed = output.llmOutput?.estimatedTokenUsage?.totalTokens;
          },
          handleLLMError: async (_err) => {
            // Handle error
          },
        },
      ],
    });

    const nonStreamingModel = new AzureChatOpenAI({
      modelName: "gpt-3.5-turbo",
      streaming: false,
      maxRetries: 10,
      maxConcurrency: 10,
      temperature: 0,
      topP: 0,
      callbacks: [
        {
          handleLLMEnd: async (output) => {
            nonStreamingTokenUsed = output.llmOutput?.tokenUsage?.totalTokens;
          },
          handleLLMError: async (_err) => {
            // Handle error
          },
        },
      ],
    });

    const [nonStreamingResult, streamingResult] = await Promise.all([
      nonStreamingModel.generate([
        [new SystemMessage(systemPrompt), new HumanMessage(question)],
      ]),
      streamingModel.generate([
        [new SystemMessage(systemPrompt), new HumanMessage(question)],
      ]),
    ]);

    expect(streamingTokenUsed).toBeGreaterThan(-1);
    if (
      nonStreamingResult.generations[0][0].text ===
      streamingResult.generations[0][0].text
    ) {
      expect(streamingTokenUsed).toEqual(nonStreamingTokenUsed);
    }
  } finally {
    process.env.LANGCHAIN_CALLBACKS_BACKGROUND = originalBackground;
  }
});

In this example, the handleLLMEnd callback is used to capture the estimatedTokenUsage for streaming calls and tokenUsage for non-streaming calls. This allows you to get the estimated token usage for both scenarios.

Additionally, the tokenUsage object includes promptTokens, completionTokens, and totalTokens, which are updated based on the response from the OpenAI API [1][2][3][4].

To continue talking to Dosu, mention @dosu.