Token usage is not available when using `stream` or `streamEvents`

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain.js documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain.js rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

const prompt = HumanMessagePromptTemplate.fromTemplate(
  'Just respond with "test".',
);
const streamingChain = RunnableSequence.from([
  prompt,
  new ChatOpenAI({
    modelName: "gpt-4",
    streaming: true,
  }),
]);
const nonStreamingChain = RunnableSequence.from([
  prompt,
  new ChatOpenAI({
    modelName: "gpt-4",
    streaming: false,
  }),
]);
{
  const response = await nonStreamingChain.invoke({});
  // Token usage is reported when using invoke and non-streaming.
  console.log(
    response.usage_metadata, // { input_tokens: 13, output_tokens: 1, total_tokens: 14 }
    response.response_metadata, // { tokenUsage: { completionTokens: 1, promptTokens: 13, totalTokens: 14 } }
  );
}
{
  const response = await streamingChain.invoke({});
  // Token usage is estimated when using invoke and streaming.
  console.log(
    response.usage_metadata, // undefined
    response.response_metadata, // { prompt: 0, completion: 0, estimatedTokenUsage: { promptTokens: 13, completionTokens: 1, totalTokens: 14 } }
  );
}
{
  const stream = await nonStreamingChain.stream({});
  // No token usage when using stream, regardles of streaming/non-streaming.
  for await (const chunk of stream) {
    console.log(
      chunk.usage_metadata, // undefined
      chunk.response_metadata // { prompt: 0, completion: 0 }
    );
  }
}
{
  const eventStream = nonStreamingChain.streamEvents({}, { version: "v2" });
  // No token usage when using eventStream, regardles of streaming/non-streaming.
  for await (const event of eventStream) {
    if (event.data.output)
      console.log(
        event.data.output.usage_metadata, // undefined
        event.data.output.response_metadata, // { prompt: 0, completion: 0 }
      );
    if (event.data.chunk)
      console.log(
        event.data.chunk.usage_metadata, // undefined
        event.data.chunk.response_metadata, // { prompt: 0, completion: 0 }
      );
  }
}

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to get the token usage when using streamEvents. I understand that token usage is not reported when streaming with OpenAI. Langchain therefore estimates the token usage when calling invoke on a streaming model.

I would expect the same estimation to be done when calling stream or streamEvents, making the estimated token usage available in the result.

System Info

@langchain/core@0.2.27
@langchain/openai@0.2.7
Platform: linux
Node.js v20.16.0
NPM v10.8.1

To get the estimated token usage when using stream or streamEvents in LangChain.js with OpenAI models, you can use the handleLLMEnd callback to capture the token usage information. Here is an example of how you can do this:

test("Test ChatOpenAI token usage reporting for streaming calls", async () => {
  process.env.LANGCHAIN_CALLBACKS_BACKGROUND = "false";

  try {
    let streamingTokenUsed = -1;
    let nonStreamingTokenUsed = -1;

    const systemPrompt = "You are a helpful assistant";
    const question = "What is the color of the night sky?";

    const streamingModel = new AzureChatOpenAI({
      modelName: "gpt-3.5-turbo",
      streaming: true,
      maxRetries: 10,
      maxConcurrency: 10,
      temperature: 0,
      topP: 0,
      callbacks: [
        {
          handleLLMEnd: async (output) => {
            streamingTokenUsed = output.llmOutput?.estimatedTokenUsage?.totalTokens;
          },
          handleLLMError: async (_err) => {
            // Handle error
          },
        },
      ],
    });

    const nonStreamingModel = new AzureChatOpenAI({
      modelName: "gpt-3.5-turbo",
      streaming: false,
      maxRetries: 10,
      maxConcurrency: 10,
      temperature: 0,
      topP: 0,
      callbacks: [
        {
          handleLLMEnd: async (output) => {
            nonStreamingTokenUsed = output.llmOutput?.tokenUsage?.totalTokens;
          },
          handleLLMError: async (_err) => {
            // Handle error
          },
        },
      ],
    });

    const [nonStreamingResult, streamingResult] = await Promise.all([
      nonStreamingModel.generate([
        [new SystemMessage(systemPrompt), new HumanMessage(question)],
      ]),
      streamingModel.generate([
        [new SystemMessage(systemPrompt), new HumanMessage(question)],
      ]),
    ]);

    expect(streamingTokenUsed).toBeGreaterThan(-1);
    if (
      nonStreamingResult.generations[0][0].text ===
      streamingResult.generations[0][0].text
    ) {
      expect(streamingTokenUsed).toEqual(nonStreamingTokenUsed);
    }
  } finally {
    process.env.LANGCHAIN_CALLBACKS_BACKGROUND = originalBackground;
  }
});

In this example, the handleLLMEnd callback is used to capture the estimatedTokenUsage for streaming calls and tokenUsage for non-streaming calls. This allows you to get the estimated token usage for both scenarios.

Additionally, the tokenUsage object includes promptTokens, completionTokens, and totalTokens, which are updated based on the response from the OpenAI API ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

langchain-ai / langchainjs