ljvmiranda921 / ljvmiranda921.github.io

✨ Github repository for my website
https://ljvmiranda921.github.io/
Creative Commons Attribution 4.0 International
63 stars 21 forks source link

Cebuano language model #355

Closed ljvmiranda921 closed 7 months ago

ljvmiranda921 commented 1 year ago

Maybe the question here is less "can we build a performant Cebuano LM" but more "what happens if you train an LM on highly-synthetic data?" Cebuano Wikipedia is 99% made by bots, so it's interesting to see its effect in the corresponding language model.

Potential impact: the training corpora of most LMs today aren't really devoid of bot text (The Pile, CommonCrawl, C4, etc.), and in the future the amount of synthetic data in our training pool might increase. Perhaps Cebuano might be an interesting "patient zero" for this "synthetic proliferation."

Fun titles: