请问扩充中文词表的作用是什么呀

git-cloner / llama2-lora-fine-tuning

llama2 finetuning with deepspeed and lora

https://gitclone.com/aiit/chat/

MIT License

162 stars 14 forks source link

请问扩充中文词表的作用是什么呀 #1

Closed goog closed 1 year ago

little51 commented 1 year ago

个人认为推理生成的语句来自于字典的排列组合，至少GPT2是这样的，字典项没有的词会报UK，在这里参考了Chinese-LLaMA-Alpaca，最后效果也是没多大改善的，在做https://github.com/git-cloner/Llama2-chinese的时候，没扩充词表，效果也还可以，所以我也没深入了解

goog commented 1 year ago

了解了分词太麻烦， gpt tokens 是为了压缩信息的。