Closed pokutuna closed 2 months ago
To address the issue of the Replacement Character (�) appearing in the multibyte text output from Google VertexAI Web, you need to ensure that the stream chunks are properly handled, especially when a chunk is cut in the middle of a multibyte character. This can be done by implementing a buffer to store incomplete multibyte characters and appending them to the next chunk.
Here is an example of how you can modify the stream handling to avoid the issue:
import { VertexAI } from "@langchain/google-vertexai-web";
const langchainModel = new VertexAI({
model: "gemini-1.5-pro-001",
location: "us-central1",
});
const prompt = "日本のことわざをできるだけたくさん挙げて";
const stream = await langchainModel.stream(prompt);
const reader = stream.getReader();
let buf = "";
let decoder = new TextDecoder("utf-8");
let incomplete = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
// Decode the chunk and handle incomplete multibyte characters
const text = decoder.decode(value, { stream: true });
buf += incomplete + text;
incomplete = "";
// Check for incomplete multibyte characters at the end of the buffer
const lastChar = buf.charAt(buf.length - 1);
if (lastChar >= '\uD800' && lastChar <= '\uDBFF') {
incomplete = buf.slice(-1);
buf = buf.slice(0, -1);
}
}
buf += decoder.decode(); // Decode any remaining bytes
console.log(buf);
This code uses a TextDecoder
to handle the UTF-8 decoding of the stream chunks and ensures that incomplete multibyte characters are properly managed by storing them in a buffer and appending them to the next chunk. This approach prevents the appearance of replacement characters in the output.
For more details on the changes made to address this issue, you can refer to the pull request fix handling of multibyte characters in streams for google-gauth [1][2].
To continue talking to Dosu, mention @dosu.
Checked other resources
Example Code
Make the model output long texts containing multibyte characters as a stream.
This code can be executed by creating a service account key from the Google Cloud Console and running it with the following command:
$ GOOGLE_WEB_CREDENTIALS=$(cat ./key.json) npx tsx sample.ts
Error Message and Stack Trace (if applicable)
(No errors or stack traces occur)
Output Example: Includes Replacement Characters (�)
Description
This is the same issue as #5285. While #5285 is about
@langchain/google-vertexai
, this issue also occurs in@langchain/google-vertexai-web
.The problem occurs when a stream chunk is cut in the middle of a multibyte character. For detailed reasons, please refer to #5285.
I will submit a Pull Request with the fix shortly.
System Info