Integrate datasets used for LLM training as monolingual datasets

mozilla / translations

The code, training pipeline, and models that power Firefox Translations

https://mozilla.github.io/translations/

Mozilla Public License 2.0

154 stars 33 forks source link

Integrate datasets used for LLM training as monolingual datasets #766

Open marco-c opened 3 months ago

marco-c commented 3 months ago

For example https://huggingface.co/datasets/ontocord/CulturaY.

eu9ene commented 2 months ago

Based on the description it's not clear what's the difference with HPLT 1.2 which we already integrated. They say the cleaning procedures are different. We use HPLT fluency scores to discard noisier data.

gregtatum commented 2 months ago

It may also be the case that we have enough monolingual data for now with integrating HPLT and NLLB monolingual data.