Replacement Character(�) appears in multibyte text output from Google VertexAI

pokutuna commented 5 months ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain.js documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain.js rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Make the model output long texts containing multibyte characters as a stream.

import { VertexAI } from "@langchain/google-vertexai";

// Set your project ID and pass the credentials according to the doc.
// https://js.langchain.com/docs/integrations/llms/google_vertex_ai
const project = "YOUR_PROJECT_ID";

const langchainModel = new VertexAI({
  model: "gemini-1.5-pro-preview-0409",
  location: "us-central1",
  authOptions: { projectId: project },
});

// EN: List as many Japanese proverbs as possible.
const prompt = "日本のことわざをできるだけたくさん挙げて";
for await (const chunk of await langchainModel.stream(prompt)) {
  process.stdout.write(chunk);
}

Error Message and Stack Trace (if applicable)

(No errors or stack traces occur)

Output Example: Includes Replacement Characters (�)

## ������������：知恵の宝庫

日本のことわざは、長い歴史の中で培われた知恵や教訓が詰まった、短い言葉の宝庫で������いくつかご紹介しますね。

**人生・教訓**

* **井の中の蛙大海を知らず** (I no naka no kawazu taikai wo shirazu):  狭い世界しか知らない者のたとえ。
* **石の上にも三年** (Ishi no ue ni mo san nen):  ������強く努力すれば成功する。
* **案ずるより産むが易し** (Anzuru yori umu ga yasushi):  心配するよりも行動した方が良い。
* **転�������������** (Korobanu saki no tsue):  前もって準備をすることの大切さ。
* **失敗は成功のもと** (Shippai wa seikou no moto):  失敗から学ぶことで成功�������る。

**人���関係**

* **類は友を呼ぶ** (Rui wa tomo wo yobu):  似た者同士が仲良くなる。
* **情けは人の為ならず** (Nasake wa hito no tame narazu):  人に親切にすることは巡り巡��て自分に良いことが返ってくる。
* **人の振り見て我が振り直せ** (Hito no furi mite waga furi naose):  他人の行動を見て自分の行動を反省する。
* **出る杭は打たれる** (Deru kui wa utareru):  他人より目���つ��叩かれる。
* **三人寄れば文殊の知恵** (Sannin yoreba monju no chie):  みんなで知恵を出し合えば良い考えが浮かぶ。

...

Description

This issue occurs when requesting outputs from the model in languages that include multibyte characters, such as Japanese, Chinese, Russian, Greek, and various other languages, or in texts that include emojis 😎.

This issue occurs due to the handling of streams containing multibyte characters and the behavior of buffer.toString() method in Node. https://github.com/langchain-ai/langchainjs/blob/a1ed4fee2d8ebf8d26e1979a4d463c96c38cd177/libs/langchain-google-gauth/src/auth.ts#L15

When receiving a stream containing multibyte characters, the point at which a chunk (readable.on('data', ...) is executed) is may be in the middle of a character’s byte sequence. For instance, the emoji "👋" is represented in UTF-8 as 0xF0 0x9F 0x91 0x8B. The callback might be executed after only 0xF0 0x9F has been received.

buffer.toString() attempts to decode byte sequences assuming UTF-8 encoding. If the bytes are invalid, it does not throw an error, instead silently outputs a REPLACEMENT CHARACTER (�). https://nodejs.org/api/buffer.html#buffers-and-character-encodings

To resolve this, use TextDecoder with the stream option. https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/decode

Related Issues

The issue has been reported below, but it persists even in the latest version.

https://github.com/langchain-ai/langchainjs/issues/4113

The same issue occurred when using Google Cloud's client libraries instead of LangChain, but it has been fixed.

I will send a Pull Request later, but I am not familiar with this codebase, and there are many google-related packages under libs/ which I have not grasped enough. Any advice would be appreciated.

System Info

macOS
node v20.12.2

langchain versions

$ npm list --depth=1 | grep langchain
├─┬ @langchain/community@0.0.54
│ ├── @langchain/core@0.1.61
│ ├── @langchain/openai@0.0.28
├─┬ @langchain/google-vertexai@0.0.12
│ ├── @langchain/core@0.1.61 deduped
│ └── @langchain/google-gauth@0.0.12
├─┬ langchain@0.1.36
│ ├── @langchain/community@0.0.54 deduped
│ ├── @langchain/core@0.1.61 deduped
│ ├── @langchain/openai@0.0.28 deduped
│ ├── @langchain/textsplitters@0.0.0
│ ├── langchainhub@0.0.8

dosubot[bot] commented 5 months ago

To address the issue of Replacement Characters (�) appearing in outputs with multibyte characters from Google VertexAI, it's crucial to handle UTF-8 character encoding properly, especially when dealing with streams. The core of the problem lies in how chunks are processed and converted to strings, potentially splitting multibyte characters across chunks.

Here's a more direct approach to solve this:

Accumulate Data in a Buffer: Instead of converting each chunk to a string immediately, accumulate the data in a Buffer. This ensures that you're not splitting any multibyte characters.
Use StringDecoder for Conversion: Utilize the StringDecoder from Node.js's string_decoder module to convert the entire Buffer to a string after all data has been received. StringDecoder is designed to handle multibyte characters that may be split across multiple chunks.

Here is a simplified example:

const { StringDecoder } = require('string_decoder');
const decoder = new StringDecoder('utf8');

let buffer = Buffer.alloc(0); // Initialize an empty buffer

// Example: Simulate receiving chunks of data
const chunks = [Buffer.from([0xE3, 0x81, 0x82]), Buffer.from([0xE3, 0x81, 0x84])];

chunks.forEach((chunk) => {
  buffer = Buffer.concat([buffer, chunk]); // Accumulate chunks
});

// Convert buffer to string, correctly handling multibyte characters
const str = decoder.write(buffer) + decoder.end();

console.log(str); // Correctly prints characters without replacement characters

This approach ensures that you're not inadvertently introducing replacement characters by splitting multibyte characters across chunks.

For contributing to the LangChain.js codebase with a fix for this issue:

Fork and clone the repository.
Create a new branch for your changes.
Implement the fix, ensuring you handle multibyte characters correctly as described.
Add tests to verify your changes.
Commit and push your changes to your fork, then open a pull request to the original repository with a detailed description of your fix.

Make sure to review the LangChain.js contribution guidelines in their CONTRIBUTING.md file for more detailed instructions on how to contribute effectively.

#### Details

**Sources** - [langchain/src/util/tests/googlevertexai-connection.test.ts](https://github.com/langchain-ai/langchainjs/blob/main/langchain/src/util/tests/googlevertexai-connection.test.ts) - [libs/langchain-google-vertexai-web/src/tests/chat_models.int.test.ts](https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-google-vertexai-web/src/tests/chat_models.int.test.ts) - [langchain-core/src/output_parsers/tests/output_parser.test.ts](https://github.com/langchain-ai/langchainjs/blob/main/langchain-core/src/output_parsers/tests/output_parser.test.ts) - [examples/src/llms/googlevertexai-streaming.ts](https://github.com/langchain-ai/langchainjs/blob/main/examples/src/llms/googlevertexai-streaming.ts) - [libs/langchain-community/src/llms/googlevertexai/common.ts](https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-community/src/llms/googlevertexai/common.ts) - [libs/langchain-google-common/src/utils/stream.ts](https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-google-common/src/utils/stream.ts) - [CONTRIBUTING.md](https://github.com/langchain-ai/langchainjs/blob/main/CONTRIBUTING.md) - [.github/pull_request_template.md](https://github.com/langchain-ai/langchainjs/blob/main/.github/pull_request_template.md)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

pokutuna commented 5 months ago

As @dosu suggested, I've implemented a solution in #5286, but instead of using string_decoder, I used TextDecoder.

TextDecoder is standardized as the WHATWG Encoding API. https://encoding.spec.whatwg.org/

By using stream: true option, it internally manages a buffer, so no need for manual buffering.

const decoder = new TextDecoder();
decoder.decode(Buffer.from([0xf0, 0x9f]), { stream: true }); // ''
decoder.decode(Buffer.from([0x91, 0x8b]), { stream: true }); // '👋'

Additionally, TextDecoder works in both browsers and Node, making it portable (but this aspect is not important for this fixes).

dosubot[bot] commented 5 months ago

The use of TextDecoder with the stream: true option is an excellent solution for handling multibyte characters in streams, as it avoids the issue of splitting characters across chunks by internally managing a buffer. This approach ensures that characters are decoded correctly once all parts of the character have been received, effectively preventing the appearance of replacement characters (�) in the output. Your implementation in #5286 aligns well with best practices for handling UTF-8 encoded text in a streaming context and leverages modern, standardized APIs for text decoding. This method's compatibility with both browsers and Node.js enhances its utility and applicability in various environments. Your contribution is valuable and helps improve the handling of multibyte characters in the repository.

#### Details

**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

rossanodr commented 4 months ago

didi you managed to solve the character problem? I have the following code `export async function decodeLangChainAdapter( response: Response ): Promise { const reader = response.body?.getReader(); if (!reader) { throw new Error("Erro ao obter o reader do corpo da resposta."); }

let result = "";
for await (const { type, value } of readDataStream(reader, {
    isAborted: () => false,
})) {
    if (type === "text") {

        result += value;
    }
}

return result;

} ` when I use it with gemini or vertexai I got the � in several portuguese words

pokutuna commented 4 months ago

@rossanodr

Try to use TextDecoderStream

const response = await fetch("https://example.com");
const reader = response.body
  .pipeThrough(new TextDecoderStream("utf-8"))
  .getReader();

// I don't know what readDataStream does
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  console.log(value);
}

matzkoh commented 2 months ago

@langchain/google-vertexai-web still has this problem.

pokutuna commented 2 months ago

@matzkoh Web is not addressed here. I'm interested in this topic, could you provide a brief example?

matzkoh commented 2 months ago

@pokutuna Thank you for your continued interest in this issue, even though it was closed. I appreciate your attention to this matter. The problem can be reproduced simply by replacing @langchain/google-vertexai with @langchain/google-vertexai-web in your sample code.

pokutuna commented 1 month ago

@matzkoh the fix has been released in @langchain/google-vertexai-web 0.0.26

langchain-ai / langchainjs