Add larger-than-character-level subword vocab for non-latin languages?

cstorm125 commented 1 year ago

Hi OpenLM team,

Thank for such great contribution to the open source community. At PythaiNLP we've been trying to replicate Alpaca-like instruction followers for non-latin languages (Thai, Japanese, Vietnamese and so on) with XGLM. We found that if the pretrained model has at least anything better than character-level subwords in these languages, we can make Alpacas out of them (more details in our blog), if not the model will understand the language but will output gibberish as (we speculate) the character-level subwords are too noisy to predict.

While I'm very excited about OpenLLaMA, from what I've read so far, the only difference is the dataset which means it still understands but cannot generate these non-latin languages after finetuning. Would it be possible for you to use a set of subword vocab where the non-latin languages have larger-than-character-level subwords such as XGLM's 250k set?

young-geng commented 1 year ago

Thanks for the suggestions! At the moment, we want to stick to the original LLaMA configurations as much as possible and don't have the resources to retrain our model with a different tokenizer. We will look into this in the future for the next model we train.

cstorm125 commented 1 year ago

@young-geng Thank you! I read you might be training a 3b version. It would be absolutely fantastic if you could somehow consider this proposal for that.

openlm-research / open_llama

Add larger-than-character-level subword vocab for non-latin languages? #2