Using the microsoft/Phi-3-medium-128k-instruct model, I received incorrect responses for multi-byte characters (commonly seen in Japanese or Chinese), as shown below:
Well the streaming detokenizer and the naive tokenizer should give the same results. For now you can use the naive one until we fix the streaming one. It will be a little slower, but otherwise should work fine.
Using the microsoft/Phi-3-medium-128k-instruct model, I received incorrect responses for multi-byte characters (commonly seen in Japanese or Chinese), as shown below:
This issue can be fixed by setting
is_spm_decoder
to False and usingNaiveStreamingDetokenizer
instead ofSPMStreamingDetokenizer
:Are there any guidelines or recommendations on which
Detokenizer
class to use (or settings to apply) to get correct characters?