Open marco-c opened 3 months ago
Based on the description it's not clear what's the difference with HPLT 1.2 which we already integrated. They say the cleaning procedures are different. We use HPLT fluency scores to discard noisier data.
It may also be the case that we have enough monolingual data for now with integrating HPLT and NLLB monolingual data.
For example https://huggingface.co/datasets/ontocord/CulturaY.