langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
12.84k stars 2.22k forks source link

Replacement Character(�) appears in multibyte text output from Google VertexAI Web #6501

Closed pokutuna closed 3 months ago

pokutuna commented 3 months ago

Checked other resources

Example Code

Make the model output long texts containing multibyte characters as a stream.

import { VertexAI } from "@langchain/google-vertexai-web";

const langchainModel = new VertexAI({
  model: "gemini-1.5-pro-001",
  location: "us-central1",
});

// EN: List as many Japanese proverbs as possible.
const prompt = "日本のことわざをできるだけたくさん挙げて";

const stream = await langchainModel.stream(prompt);
const reader = stream.getReader();
let buf = "";
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buf += value;
}
console.log(buf);

This code can be executed by creating a service account key from the Google Cloud Console and running it with the following command: $ GOOGLE_WEB_CREDENTIALS=$(cat ./key.json) npx tsx sample.ts

Error Message and Stack Trace (if applicable)

(No errors or stack traces occur)

Output Example: Includes Replacement Characters (�)

## ���本の諺 (ことわざ)  -  できるだけたくさん!

**一般的な知������������**

* 石の上にも三年 (いしのうえにもさんねん) - Perseverance will pay off.
* 七転び八起き (ななころびやおき) - Fall seven times, stand up eight.
* 継続は力なり (けいぞくはちからなり) -  Persistence is power.
* 急がば回れ (い��がばまわれ) - Haste makes waste.
* 井の中の蛙大海を知らず (いのなかのかわずたいかいをしらず) - A frog in a well knows nothing of the great ocean.
* 良���は���に苦し (りょうやくはくちにくい) -  Good medicine tastes bitter.
* 猿も木から落ちる (さるもきからおちる) - Even monkeys fall from trees.
* 転石苔を生ぜず (てんせきこけをしょうぜず) - A rolling stone gathers no moss.
* 覆水盆に返らず (ふくすいぼんにかえらず) - Spilled water will not return to the tray.
* 後生の祭り (ごしょうの�����り) - Too late for regrets.
* 習うより慣れろ (ならうよりなれろ) -  Experience is the best teacher.
* 鉄は熱いうちに打て (てつはあついうちにうて) - Strike while the iron is hot.

...

Description

This is the same issue as #5285. While #5285 is about @langchain/google-vertexai, this issue also occurs in @langchain/google-vertexai-web.

The problem occurs when a stream chunk is cut in the middle of a multibyte character. For detailed reasons, please refer to #5285.

I will submit a Pull Request with the fix shortly.

System Info

dosubot[bot] commented 3 months ago

To address the issue of the Replacement Character (�) appearing in the multibyte text output from Google VertexAI Web, you need to ensure that the stream chunks are properly handled, especially when a chunk is cut in the middle of a multibyte character. This can be done by implementing a buffer to store incomplete multibyte characters and appending them to the next chunk.

Here is an example of how you can modify the stream handling to avoid the issue:

import { VertexAI } from "@langchain/google-vertexai-web";

const langchainModel = new VertexAI({
  model: "gemini-1.5-pro-001",
  location: "us-central1",
});

const prompt = "日本のことわざをできるだけたくさん挙げて";

const stream = await langchainModel.stream(prompt);
const reader = stream.getReader();
let buf = "";
let decoder = new TextDecoder("utf-8");
let incomplete = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  // Decode the chunk and handle incomplete multibyte characters
  const text = decoder.decode(value, { stream: true });
  buf += incomplete + text;
  incomplete = "";

  // Check for incomplete multibyte characters at the end of the buffer
  const lastChar = buf.charAt(buf.length - 1);
  if (lastChar >= '\uD800' && lastChar <= '\uDBFF') {
    incomplete = buf.slice(-1);
    buf = buf.slice(0, -1);
  }
}

buf += decoder.decode(); // Decode any remaining bytes
console.log(buf);

This code uses a TextDecoder to handle the UTF-8 decoding of the stream chunks and ensures that incomplete multibyte characters are properly managed by storing them in a buffer and appending them to the next chunk. This approach prevents the appearance of replacement characters in the output.

For more details on the changes made to address this issue, you can refer to the pull request fix handling of multibyte characters in streams for google-gauth [1][2].

To continue talking to Dosu, mention @dosu.