tinkerhub / malayalam-llm

A malayalam large language model finetuned on top of open source models
2 stars 1 forks source link

Train sentencepiece tokenizer on larger text #1

Open gksoriginals opened 10 months ago

gksoriginals commented 10 months ago

Refer this colab for building a sentencepiece tokenizer.

  1. Evaluate the tokenizer using token fertility based on this paper