Closed hvenev-insait closed 7 months ago
cc @robertcraigie
On Tue, Mar 5 2024 at 6:33 AM, Hristo Venev @.***> wrote:
Confirm this is a Node library issue and not an underlying OpenAI API issue
- This is an issue with the Node library
Describe the bug
When using the streaming API, sometimes tokens get corrupted. Characters get replaced by two or more \uFFFD. For example:
{ choices: [ { text: ' из��естни' } ], }
when the token received is actually supposed to be ' известни'.
The issue occurs because LineDecoder does not deal with multi-byte characters on chunk boundaries. Instead of using a separate TextDecoder instance per buffer, perhaps it should use a single TextDecoderStream for the entire stream. To Reproduce
- Send a streaming completion request that will get non-ASCII tokens as a response.
- Observe the output. With some probability, some of the tokens will be corrupted.
Code snippets
No response OS
Linux Node version
Node v18.19.1 Library version
openai v4.14.2
— Reply to this email directly, view it on GitHub https://github.com/openai/openai-node/issues/706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFL6LX2Q47TOQSZBIXTLCLYWWUPRAVCNFSM6AAAAABEG6RBTWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE3DQOJWHA3DKNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you for the report! We can reproduce this and expect to have a fix out in the coming days. Thanks for your patience.
Any update on this? We're also seeing issues with the whisper transcription API.
Sorry, we've been a bit delayed here - we hope to take another crack tomorrow.
@rattrayalex Any progress on this or any way I can possibly help?
Sorry I forgot to update here, this should be fixed now! Have you tried the latest version?
Amazing! I'll snag the latest version and put it through its paces. Thanks so much!
Just tested out v4.33.0. Still seeing unknown characters for:
Example response: "Claro, me gustar�a saber sobre una situaci�n espec�fica..."
Can you share an example script to reproduce that?
Here's a silly example:
const openAi = new OpenAI({ apiKey });
const response = await openAi.chat.completions.create({
model: 'gpt-4-1106-preview',
messages: [
{ role: 'assistant', content: 'Name a topic' },
{ role: 'user', content: 'The sun.' },
],
tools: [
{
type: 'function',
function: {
name: 'limerick_func',
description:
'returns a limerick in Spanish when the USER asks about a topic',
parameters: {
type: 'object',
properties: {
limerick: {
type: 'string',
description: 'The contents of the Spanish limerick',
},
},
required: ['limerick'],
},
},
},
],
stream: false,
});
console.log(response.choices[0].message.tool_calls[0].function.arguments);
Output: {"limerick":"En el cielo un astro se ve,\nque alumbra con fuerza y fe,\nel sol sin igual,\nda luz y calor vital,\ny en el d�a su poder se ve."}
Thanks, ran that a couple of times and didn't see any weird characters.
That script also isn't using streaming? So I suspect something else is happening in your case
We're using a mix of streaming, not streaming, tool calls, and whisper transcriptions. It's not happening every time, but sometimes. Also sometimes function arguments return results like this:
{"limerick":"En el cielo brilla el sol,\ncon su luz da inspiraci\\u00f3n,\nen el d\\u00eda mucho ardor,\nde noche se va sin adi\\u00f3s,\nregalando al mundo su pasi\\u00f3n."}
I put together another script that only returns streaming completions and so far I haven't been able to reproduce the issue there - but it's definitely happening in function args and whisper-1 transcriptions. Maybe this is partially fixed?
A reliable repro would be most useful!
"Reliable" is the tricky bit here as the issue isn't consistent. However after running the previous script repeatedly I'm able to get invalid characters in the function arguments:
{"limerick":"En el cielo brilla el sol,\ncon sus rayos da calor,\nilumina el d�a entero,\nes de fuego un gran lucero,\nsu belleza es sin igual, un esplendor."}
as well as the occasional unicode escape characters:
{"limerick":"En el cielo brilla el sol,\ncon su luz da inspiraci\\u00f3n,\nilumina el d\\u00eda,\ncon su energ\\u00eda,\ny nos da mucha emoci\\u00f3n."}
ok thanks, we'll try to take a look next week!
@benchaikin thanks for the repro, I did manage to reproduce the issue you're seeing but unfortunately it does not appear to be an SDK issue.
I could also reproduce this with the Python SDK and the API is responding with the binary data for the � symbol directly, it's not a decoding issue. i.e. it's responding with the binary data \xef\xbf\xbd
which when decoded, turns into �.
I've passed this on to the OpenAI team.
Hey @benchaikin, this was an issue with the gpt-4-1106-preview
model in particular which was fixed with gpt-4-0125-preview
, so if you upgrade to that model or an even newer one then your issue should be resolved :)
Here's the community forum post for this https://community.openai.com/t/gpt-4-1106-preview-messes-up-function-call-parameters-encoding/478500/103?u=atty-openai
Awesome, thank you for following up! I'll try this out today and report back.
Confirm this is a Node library issue and not an underlying OpenAI API issue
Describe the bug
When using the streaming API, sometimes tokens get corrupted. Characters get replaced by two or more
\uFFFD
. For example:when the token received is actually supposed to be
' известни'
.The issue occurs because
LineDecoder
does not deal with multi-byte characters on chunk boundaries. Instead of using a separateTextDecoder
instance per buffer, perhaps it should use a singleTextDecoderStream
for the entire stream.To Reproduce
Code snippets
No response
OS
Linux
Node version
Node v18.19.1
Library version
openai v4.14.2