Open lucasjinreal opened 1 year ago
Hi @lucasjinreal. We need more information in order to assist you in resolving the issue.
May I ask which model you are using? Are you using it through the API or through Python?
@peakji Ithink its not related about model. For model am simple using Llama.
The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.
For instance, for ids: [34, 56, 656], tokenizers would decode like: I love u
But if you decode one by one, you will got: Iloveu
It doesn't presever these spaces, and Chinese characters even worse.
However, am not sure is because of this or not for real.
But above is the problems I have indeed.
What's your think? (Mainly simple word do not have spaces compare as original, and Chinese if wrong decoding)
Or maybe these is something missed inside your StreamTokenizer? (like ignored some ids). Can u try get decode ids one by one and print it?
outputs = []
for oid in output_ids:
# if i > len(input_ids[0]):
# print(oid)
word = tokenizer.decode(oid[0])
print(word, end='')
outputs.append(word)
# else:
# i += 1
print()
outputs = ''.join(outputs)
Me was wrong
The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different.
StreamTokenizer
is specifically designed to handle this properly.
There is an example of the LLaMA tokenizer in the test case, which also includes Chinese characters:
https://github.com/hyperonym/basaran/blob/master/tests/test_tokenizer.py#L48
@peakji Thanks, I just using tokenizer of StreamModel and the Chinese decoding error problems still exist.
And I still can not get the spaces between engliesh words .
I think the output stream has some problems, How can I combine it using with model and tokenizer and print correct words in terminal?
I got no space and Chinese were wrong either (try print(word, end=''))
I don't want change line in every word and I don't want unexpect spaces in un-English characters.
Could you please provide some example code for us to reproduce the issue?
The output in your first screenshot is apparently not from StreamTokenizer
.
@peakji second one is, I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.
Can u guys provide a effect print correct values without change line demo? (correctly all print word one by one
I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode.
You shouldn't use model.tokenizer
directly because it's not a stateful StreamTokenizer
but a stateless Huggingface tokenizer.
The correct way could be either:
a. Call the model directly without the need for manual detokenization: https://github.com/hyperonym/basaran/blob/5ef5ef006b2acd59d0512409e41e693b142aef66/examples/basaran-python-library/main.py#L8-L9
b. Create an instance of StreamTokenizer
and use that instead: https://github.com/hyperonym/basaran/blob/5ef5ef006b2acd59d0512409e41e693b142aef66/tests/test_tokenizer.py#L54-L61
@peakji thank u! I have solved the first problem.
the english seems oK now. but Chinese still not OK
the Chinese characters some are ok, some still got weird coding style
Some \n which is actually needed seems trimed:
I resolved the \n issue, but clearly the Chinese not always work:
Please take a deeper test!
We need more information to assist you in resolving the issue. These screenshots alone don't provide much valuable information.
Could you please provide the code you are testing for us to reproduce?
How to resolve this problem?>