Closed gregtatum closed 1 month ago
We currently process HPLT manually with https://github.com/mozilla/firefox-translations-training/blob/main/utils/download_hplt.py and upload it to GCP to use later as a custom dataset. We can integrate this script to be a part of the pipeline as a regular dataset importer.
The data was produced from web crawls, and has a cleaned version of the data. It includes language detection via FastSpell (a combo of FastText and Hunspell). It also includes fluency scoring (a 7-gram modified Knesser-Ney character language model).
https://arxiv.org/abs/2403.14009
The data comes in as jsonl. Each line is a document, but the text is newline delimited.
Example line:
In order to integrate this data source we would need to locate and download the files. These are structured logically and documented here: https://hplt-project.org/datasets/v1.2
We would want to use the clean data.
Then for each document, we would need to split at the "paragraph" level, which is newline delimited. Optionally we could include a hyperparameter to combine multiple paragraphs into one.
Then we would need to decide on a score threshold, which is another hyperparameter.
I think with this would we would be good to use the data in the pipeline.