Closed pokutuna closed 5 months ago
To address the issue of Replacement Characters (�) appearing in outputs with multibyte characters from Google VertexAI, it's crucial to handle UTF-8 character encoding properly, especially when dealing with streams. The core of the problem lies in how chunks are processed and converted to strings, potentially splitting multibyte characters across chunks.
Here's a more direct approach to solve this:
Accumulate Data in a Buffer: Instead of converting each chunk to a string immediately, accumulate the data in a Buffer. This ensures that you're not splitting any multibyte characters.
Use StringDecoder
for Conversion: Utilize the StringDecoder
from Node.js's string_decoder
module to convert the entire Buffer to a string after all data has been received. StringDecoder
is designed to handle multibyte characters that may be split across multiple chunks.
Here is a simplified example:
const { StringDecoder } = require('string_decoder');
const decoder = new StringDecoder('utf8');
let buffer = Buffer.alloc(0); // Initialize an empty buffer
// Example: Simulate receiving chunks of data
const chunks = [Buffer.from([0xE3, 0x81, 0x82]), Buffer.from([0xE3, 0x81, 0x84])];
chunks.forEach((chunk) => {
buffer = Buffer.concat([buffer, chunk]); // Accumulate chunks
});
// Convert buffer to string, correctly handling multibyte characters
const str = decoder.write(buffer) + decoder.end();
console.log(str); // Correctly prints characters without replacement characters
This approach ensures that you're not inadvertently introducing replacement characters by splitting multibyte characters across chunks.
For contributing to the LangChain.js codebase with a fix for this issue:
Make sure to review the LangChain.js contribution guidelines in their CONTRIBUTING.md file for more detailed instructions on how to contribute effectively.
To continue the conversation, mention @dosu.
As @dosu suggested, I've implemented a solution in #5286, but instead of using string_decoder
, I used TextDecoder
.
TextDecoder
is standardized as the WHATWG Encoding API.
https://encoding.spec.whatwg.org/
By using stream: true
option, it internally manages a buffer, so no need for manual buffering.
const decoder = new TextDecoder();
decoder.decode(Buffer.from([0xf0, 0x9f]), { stream: true }); // ''
decoder.decode(Buffer.from([0x91, 0x8b]), { stream: true }); // '👋'
Additionally, TextDecoder
works in both browsers and Node, making it portable (but this aspect is not important for this fixes).
The use of TextDecoder
with the stream: true
option is an excellent solution for handling multibyte characters in streams, as it avoids the issue of splitting characters across chunks by internally managing a buffer. This approach ensures that characters are decoded correctly once all parts of the character have been received, effectively preventing the appearance of replacement characters (�) in the output. Your implementation in #5286 aligns well with best practices for handling UTF-8 encoded text in a streaming context and leverages modern, standardized APIs for text decoding. This method's compatibility with both browsers and Node.js enhances its utility and applicability in various environments. Your contribution is valuable and helps improve the handling of multibyte characters in the repository.
To continue the conversation, mention @dosu.
didi you managed to solve the character problem?
I have the following code
`export async function decodeLangChainAdapter(
response: Response
): Promise
let result = "";
for await (const { type, value } of readDataStream(reader, {
isAborted: () => false,
})) {
if (type === "text") {
result += value;
}
}
return result;
} ` when I use it with gemini or vertexai I got the � in several portuguese words
@rossanodr
Try to use TextDecoderStream
const response = await fetch("https://example.com");
const reader = response.body
.pipeThrough(new TextDecoderStream("utf-8"))
.getReader();
// I don't know what readDataStream does
while (true) {
const { done, value } = await reader.read();
if (done) break;
console.log(value);
}
@langchain/google-vertexai-web
still has this problem.
@matzkoh Web is not addressed here. I'm interested in this topic, could you provide a brief example?
@pokutuna
Thank you for your continued interest in this issue, even though it was closed.
I appreciate your attention to this matter.
The problem can be reproduced simply by replacing @langchain/google-vertexai
with @langchain/google-vertexai-web
in your sample code.
@matzkoh the fix has been released in @langchain/google-vertexai-web 0.0.26
Checked other resources
Example Code
Make the model output long texts containing multibyte characters as a stream.
Error Message and Stack Trace (if applicable)
(No errors or stack traces occur)
Output Example: Includes Replacement Characters (�)
Description
This issue occurs when requesting outputs from the model in languages that include multibyte characters, such as Japanese, Chinese, Russian, Greek, and various other languages, or in texts that include emojis 😎.
This issue occurs due to the handling of streams containing multibyte characters and the behavior of
buffer.toString()
method in Node. https://github.com/langchain-ai/langchainjs/blob/a1ed4fee2d8ebf8d26e1979a4d463c96c38cd177/libs/langchain-google-gauth/src/auth.ts#L15When receiving a stream containing multibyte characters, the point at which a chunk (
readable.on('data', ...)
is executed) is may be in the middle of a character’s byte sequence. For instance, the emoji "👋" is represented in UTF-8 as0xF0 0x9F 0x91 0x8B
. The callback might be executed after only0xF0 0x9F
has been received.buffer.toString()
attempts to decode byte sequences assuming UTF-8 encoding. If the bytes are invalid, it does not throw an error, instead silently outputs a REPLACEMENT CHARACTER (�). https://nodejs.org/api/buffer.html#buffers-and-character-encodingsTo resolve this, use
TextDecoder
with thestream
option. https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/decodeRelated Issues
The issue has been reported below, but it persists even in the latest version.
The same issue occurred when using Google Cloud's client libraries instead of LangChain, but it has been fixed.
I will send a Pull Request later, but I am not familiar with this codebase, and there are many google-related packages under libs/ which I have not grasped enough. Any advice would be appreciated.
System Info