huggingface / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
12.21k stars 776 forks source link

Result is wrong when decoding tokens one by one #853

Open zcbenz opened 4 months ago

zcbenz commented 4 months ago

System Info

Node.js 22.4.0 @xenova/transformers 2.17.2

Environment/Platform

Description

When decoding tokens which represents a multi-byte string, the result is wrong when decoding the tokens one by one.

import {StringDecoder} from 'node:string_decoder'
import {AutoTokenizer} from '@xenova/transformers'

const tokenizer = await AutoTokenizer.from_pretrained('Qwen/Qwen2-0.5B')

const tokens = [32, 13, 66521, 243, 28291]
console.log('Correct string:', tokenizer.decode(tokens))
console.log('Correct bytes:', Buffer.from(tokenizer.decode(tokens)))

const decoder = new StringDecoder('utf8')
let allBytes = []
process.stdout.write('\nWrong string: ')
for (const token of tokens) {
  const bytes = Buffer.from(tokenizer.decode([token]))
  allBytes.push(bytes)
  process.stdout.write(decoder.write(bytes))
}
process.stdout.write('\n')
console.log('Wrong bytes:', Buffer.concat(allBytes))

Reproduction

Running above script with Node and you can see the result:

Correct string: A. 单发
Correct bytes: <Buffer 41 2e 20 e5 8d 95 e5 8f 91>

Wrong string: A. ��发
Wrong bytes: <Buffer 41 2e 20 ef bf bd ef bf bd e5 8f 91>

I expect the bytes to be the same whether the tokens are decoded in one call, or decoded one by one.

This is probably intended results as a single token may be decoded into a partial unicode character. However this behavior makes it impossible to implement a correct streaming interface for LLMs, which I'm doing in my llm.js module.

zcbenz commented 4 months ago

I have found a workaround by detecting the replacement char \uFFFD in the decoded string: https://github.com/frost-beta/llm.js/commit/6e816b0bdfe2c161d82bbf4f2324fc32815e1fb3.