When decoding tokens which represents a multi-byte string, the result is wrong when decoding the tokens one by one.
import {StringDecoder} from 'node:string_decoder'
import {AutoTokenizer} from '@xenova/transformers'
const tokenizer = await AutoTokenizer.from_pretrained('Qwen/Qwen2-0.5B')
const tokens = [32, 13, 66521, 243, 28291]
console.log('Correct string:', tokenizer.decode(tokens))
console.log('Correct bytes:', Buffer.from(tokenizer.decode(tokens)))
const decoder = new StringDecoder('utf8')
let allBytes = []
process.stdout.write('\nWrong string: ')
for (const token of tokens) {
const bytes = Buffer.from(tokenizer.decode([token]))
allBytes.push(bytes)
process.stdout.write(decoder.write(bytes))
}
process.stdout.write('\n')
console.log('Wrong bytes:', Buffer.concat(allBytes))
Reproduction
Running above script with Node and you can see the result:
Correct string: A. 单发
Correct bytes: <Buffer 41 2e 20 e5 8d 95 e5 8f 91>
Wrong string: A. ��发
Wrong bytes: <Buffer 41 2e 20 ef bf bd ef bf bd e5 8f 91>
I expect the bytes to be the same whether the tokens are decoded in one call, or decoded one by one.
This is probably intended results as a single token may be decoded into a partial unicode character. However this behavior makes it impossible to implement a correct streaming interface for LLMs, which I'm doing in my llm.js module.
System Info
Node.js 22.4.0 @xenova/transformers 2.17.2
Environment/Platform
Description
When decoding tokens which represents a multi-byte string, the result is wrong when decoding the tokens one by one.
Reproduction
Running above script with Node and you can see the result:
I expect the bytes to be the same whether the tokens are decoded in one call, or decoded one by one.
This is probably intended results as a single token may be decoded into a partial unicode character. However this behavior makes it impossible to implement a correct streaming interface for LLMs, which I'm doing in my llm.js module.