in stream mode, the English word has no space after detokenizer and Chinese were messed up

hyperonym / basaran

Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.

MIT License

1.29k stars 80 forks source link

in stream mode, the English word has no space after detokenizer and Chinese were messed up #197

Open lucasjinreal opened 1 year ago

lucasjinreal commented 1 year ago

How to resolve this problem?>

peakji commented 1 year ago

Hi @lucasjinreal. We need more information in order to assist you in resolving the issue.

May I ask which model you are using? Are you using it through the API or through Python?

lucasjinreal commented 1 year ago

@peakji Ithink its not related about model. For model am simple using Llama.

The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.

For instance, for ids: [34, 56, 656], tokenizers would decode like: I love u

But if you decode one by one, you will got: Iloveu

It doesn't presever these spaces, and Chinese characters even worse.

However, am not sure is because of this or not for real.

But above is the problems I have indeed.

What's your think? (Mainly simple word do not have spaces compare as original, and Chinese if wrong decoding)

lucasjinreal commented 1 year ago

Or maybe these is something missed inside your StreamTokenizer? (like ignored some ids). Can u try get decode ids one by one and print it?

outputs = []
          for oid in output_ids:
              # if i > len(input_ids[0]):
              # print(oid)
              word = tokenizer.decode(oid[0])
              print(word, end='')
              outputs.append(word)
              # else:
              #     i += 1
          print()
          outputs = ''.join(outputs)

Me was wrong

peakji commented 1 year ago

The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.

StreamTokenizer is specifically designed to handle this properly.

There is an example of the LLaMA tokenizer in the test case, which also includes Chinese characters:

https://github.com/hyperonym/basaran/blob/master/tests/test_tokenizer.py#L48

lucasjinreal commented 1 year ago

@peakji Thanks, I just using tokenizer of StreamModel and the Chinese decoding error problems still exist.

And I still can not get the spaces between engliesh words .

I think the output stream has some problems, How can I combine it using with model and tokenizer and print correct words in terminal?

peakji commented 1 year ago

Here's a simple example for using Basaran as a Python library: https://github.com/hyperonym/basaran/blob/master/examples/basaran-python-library/main.py

lucasjinreal commented 1 year ago

I got no space and Chinese were wrong either (try print(word, end=''))

I don't want change line in every word and I don't want unexpect spaces in un-English characters.

peakji commented 1 year ago

Could you please provide some example code for us to reproduce the issue?

The output in your first screenshot is apparently not from StreamTokenizer.

lucasjinreal commented 1 year ago

@peakji second one is, I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.

Can u guys provide a effect print correct values without change line demo? (correctly all print word one by one

peakji commented 1 year ago

I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.

You shouldn't use model.tokenizer directly because it's not a stateful StreamTokenizer but a stateless Huggingface tokenizer.

The correct way could be either:

a. Call the model directly without the need for manual detokenization: https://github.com/hyperonym/basaran/blob/5ef5ef006b2acd59d0512409e41e693b142aef66/examples/basaran-python-library/main.py#L8-L9 b. Create an instance of StreamTokenizer and use that instead: https://github.com/hyperonym/basaran/blob/5ef5ef006b2acd59d0512409e41e693b142aef66/tests/test_tokenizer.py#L54-L61

lucasjinreal commented 1 year ago

@peakji thank u! I have solved the first problem.

the english seems oK now. but Chinese still not OK

the Chinese characters some are ok, some still got weird coding style

lucasjinreal commented 1 year ago

Some \n which is actually needed seems trimed:

lucasjinreal commented 1 year ago

I resolved the \n issue, but clearly the Chinese not always work:

Please take a deeper test!

peakji commented 1 year ago

We need more information to assist you in resolving the issue. These screenshots alone don't provide much valuable information.

Could you please provide the code you are testing for us to reproduce?