openai / openai-node

Official JavaScript / TypeScript library for the OpenAI API
https://www.npmjs.com/package/openai
Apache License 2.0
7.98k stars 870 forks source link

Non-ASCII tokens are corrupted sometimes when using the streaming API #706

Closed hvenev-insait closed 7 months ago

hvenev-insait commented 8 months ago

Confirm this is a Node library issue and not an underlying OpenAI API issue

Describe the bug

When using the streaming API, sometimes tokens get corrupted. Characters get replaced by two or more \uFFFD. For example:

{
  choices: [ { text: ' из��естни' } ],
}

when the token received is actually supposed to be ' известни'.

The issue occurs because LineDecoder does not deal with multi-byte characters on chunk boundaries. Instead of using a separate TextDecoder instance per buffer, perhaps it should use a single TextDecoderStream for the entire stream.

To Reproduce

  1. Send a streaming completion request that will get non-ASCII tokens as a response.
  2. Observe the output. With some probability, some of the tokens will be corrupted.

Code snippets

No response

OS

Linux

Node version

Node v18.19.1

Library version

openai v4.14.2

rattrayalex commented 8 months ago

cc @robertcraigie

On Tue, Mar 5 2024 at 6:33 AM, Hristo Venev @.***> wrote:

Confirm this is a Node library issue and not an underlying OpenAI API issue

  • This is an issue with the Node library

Describe the bug

When using the streaming API, sometimes tokens get corrupted. Characters get replaced by two or more \uFFFD. For example:

{ choices: [ { text: ' из��естни' } ], }

when the token received is actually supposed to be ' известни'.

The issue occurs because LineDecoder does not deal with multi-byte characters on chunk boundaries. Instead of using a separate TextDecoder instance per buffer, perhaps it should use a single TextDecoderStream for the entire stream. To Reproduce

  1. Send a streaming completion request that will get non-ASCII tokens as a response.
  2. Observe the output. With some probability, some of the tokens will be corrupted.

Code snippets

No response OS

Linux Node version

Node v18.19.1 Library version

openai v4.14.2

— Reply to this email directly, view it on GitHub https://github.com/openai/openai-node/issues/706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFL6LX2Q47TOQSZBIXTLCLYWWUPRAVCNFSM6AAAAABEG6RBTWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE3DQOJWHA3DKNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

rattrayalex commented 8 months ago

Thank you for the report! We can reproduce this and expect to have a fix out in the coming days. Thanks for your patience.

benchaikin commented 7 months ago

Any update on this? We're also seeing issues with the whisper transcription API.

rattrayalex commented 7 months ago

Sorry, we've been a bit delayed here - we hope to take another crack tomorrow.

benchaikin commented 7 months ago

@rattrayalex Any progress on this or any way I can possibly help?

RobertCraigie commented 7 months ago

Sorry I forgot to update here, this should be fixed now! Have you tried the latest version?

benchaikin commented 7 months ago

Amazing! I'll snag the latest version and put it through its paces. Thanks so much!

benchaikin commented 7 months ago

Just tested out v4.33.0. Still seeing unknown characters for:

Example response: "Claro, me gustar�a saber sobre una situaci�n espec�fica..."

RobertCraigie commented 7 months ago

Can you share an example script to reproduce that?

benchaikin commented 7 months ago

Here's a silly example:

    const openAi = new OpenAI({ apiKey });
    const response = await openAi.chat.completions.create({
      model: 'gpt-4-1106-preview',
      messages: [
        { role: 'assistant', content: 'Name a topic' },
        { role: 'user', content: 'The sun.' },
      ],
      tools: [
        {
          type: 'function',
          function: {
            name: 'limerick_func',
            description:
              'returns a limerick in Spanish when the USER asks about a topic',
            parameters: {
              type: 'object',
              properties: {
                limerick: {
                  type: 'string',
                  description: 'The contents of the Spanish limerick',
                },
              },
              required: ['limerick'],
            },
          },
        },
      ],
      stream: false,
    });

    console.log(response.choices[0].message.tool_calls[0].function.arguments);

Output: {"limerick":"En el cielo un astro se ve,\nque alumbra con fuerza y fe,\nel sol sin igual,\nda luz y calor vital,\ny en el d�a su poder se ve."}

RobertCraigie commented 7 months ago

Thanks, ran that a couple of times and didn't see any weird characters.

That script also isn't using streaming? So I suspect something else is happening in your case

benchaikin commented 7 months ago

We're using a mix of streaming, not streaming, tool calls, and whisper transcriptions. It's not happening every time, but sometimes. Also sometimes function arguments return results like this:

{"limerick":"En el cielo brilla el sol,\ncon su luz da inspiraci\\u00f3n,\nen el d\\u00eda mucho ardor,\nde noche se va sin adi\\u00f3s,\nregalando al mundo su pasi\\u00f3n."}

I put together another script that only returns streaming completions and so far I haven't been able to reproduce the issue there - but it's definitely happening in function args and whisper-1 transcriptions. Maybe this is partially fixed?

rattrayalex commented 7 months ago

A reliable repro would be most useful!

benchaikin commented 7 months ago

"Reliable" is the tricky bit here as the issue isn't consistent. However after running the previous script repeatedly I'm able to get invalid characters in the function arguments:

{"limerick":"En el cielo brilla el sol,\ncon sus rayos da calor,\nilumina el d�a entero,\nes de fuego un gran lucero,\nsu belleza es sin igual, un esplendor."}

as well as the occasional unicode escape characters:

{"limerick":"En el cielo brilla el sol,\ncon su luz da inspiraci\\u00f3n,\nilumina el d\\u00eda,\ncon su energ\\u00eda,\ny nos da mucha emoci\\u00f3n."}

rattrayalex commented 7 months ago

ok thanks, we'll try to take a look next week!

RobertCraigie commented 7 months ago

@benchaikin thanks for the repro, I did manage to reproduce the issue you're seeing but unfortunately it does not appear to be an SDK issue.

I could also reproduce this with the Python SDK and the API is responding with the binary data for the � symbol directly, it's not a decoding issue. i.e. it's responding with the binary data \xef\xbf\xbd which when decoded, turns into �.

I've passed this on to the OpenAI team.

RobertCraigie commented 7 months ago

Hey @benchaikin, this was an issue with the gpt-4-1106-preview model in particular which was fixed with gpt-4-0125-preview, so if you upgrade to that model or an even newer one then your issue should be resolved :)

Here's the community forum post for this https://community.openai.com/t/gpt-4-1106-preview-messes-up-function-call-parameters-encoding/478500/103?u=atty-openai

benchaikin commented 7 months ago

Awesome, thank you for following up! I'll try this out today and report back.