marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
MIT License
1.79k stars 135 forks source link

Streaming decode issue #118

Open lucasjinreal opened 1 year ago

lucasjinreal commented 1 year ago

Hello, for llama when decoding Chinese or Japanese characters, since one character mgith need 2 or more tokens to decode, so when streaming, the chunk returned one token decode result is wrong,

image

is there a way to resolve this?

llama.cpp actually didn't have this issue.

marella commented 1 year ago

Hi, such cases are already handled, so it shouldn't happen. Can you please share the code, prompt and link to model you are using.

lucasjinreal commented 1 year ago

@marella Sure, this is the code:

llm = AutoModelForCausalLM.from_pretrained(m_f, gpu_layers=150)

    conv = get_default_conv_template(args.conv_template)

    history = []
    while True:
        qs = input('> ')

        conv.append_message_single_turn(qs)
        prompt = conv.get_prompt()
        if args.debug:
            print(prompt)

        outputs = ''
        for text in llm(prompt, stream=True):
            print(text, end="", flush=True)
            outputs += text
        print()

        if not args.bare:
            if args.multi_turn:
                history.append({"input": qs, "output": outputs})
            else:
                conv.clear()

regardless the template composing, the output can not decode Chinese and Japannesse characters only.

For instance, prompt: 背诵古诗静夜思

Can u help on what is the issue about?

After look at your code:

# Handle incomplete UTF-8 multi-byte characters.
            incomplete += self.detokenize([token], decode=False)
            complete, incomplete = utf8_split_incomplete(incomplete)
            text += complete.decode(errors="ignore")

I think this is not Chinese characters issue on llama tokenizer, for Chinese and Japanesee some characters need 2 or more token to decode right string bytes, so your handling might actually not solving this situation, what do u think?

elliotthwang commented 10 months ago

For reference: The ASCII character of UTF-8 only occupies one byte, which is more space-saving, but the UTF-8 encoding with more characters takes up 1/2 more space,

especially for Chinese, Japanese and Korean (CJK). For block text, most of them require three bytes.