Closed JohnHerry closed 1 month ago
Since the authors are using a Character level Text Encoder, I see no problem in using the encoder of ByT5 which should be able to handle all languages irrespective of vocabulary.
Thank you for the help, I will have a try.
Hi, I see the text tokenizer just a simple "bytes(text, "UTF-8")", will it support multiple-languages? eg. If my input are Chinese-English mixed sentences, in which case each Chinese character will be converted into three tokens valued between 【0, 255】, and each English character is valued 【0, 128】, can this input be trained? or we should choose another text tokenzer in this case?