Integrate HPLT Datasets v1.2 as a monolingual dataset

The data was produced from web crawls, and has a cleaned version of the data. It includes language detection via FastSpell (a combo of FastText and Hunspell). It also includes fluency scoring (a 7-gram modified Knesser-Ney character language model).

This fluency score can be used to estimate the ‘quality’ of paragraphs in the document, allowing to filter out noise that may be detrimental for training language models.

https://arxiv.org/abs/2403.14009

The data comes in as jsonl. Each line is a document, but the text is newline delimited.

Example line:

{
  "id": 65,
  "document_lang": "fi",
  "scores": [
    0.826,
    0.386,
    0.789,
    ...
  ],
  "langs": [
    "fi",
    "en",
    "fi",
    ...
  ],
  "text": "Tulevaisuuden työelämä vaatii uudenlaista osaamista - DigiMaMa\nSkip to content\nLiity jäseneksi\n...",
  "url": "https://www.digimama.fi/artikkelit/tulevaisuuden-tyoelama-vaatii-uudenlaista-osaamista/",
  "collection": "cc40"
}

In order to integrate this data source we would need to locate and download the files. These are structured logically and documented here: https://hplt-project.org/datasets/v1.2

We would want to use the clean data.

Then for each document, we would need to split at the "paragraph" level, which is newline delimited. Optionally we could include a hyperparameter to combine multiple paragraphs into one.

Then we would need to decide on a score threshold, which is another hyperparameter.

I think with this would we would be good to use the data in the pipeline.

Language	Code	Docs	Words
Afrikaans	af	747.23K	829.49M
Arabic	ar	26.80M	31.85B
Azerbaijani	az	1.10M	1.13B
Belarusian	be	356.53K	394.19M
Bulgarian	bg	6.50M	8.76B
Bangla	bn	2.88M	2.77B
Catalan	ca	4.54M	5.76B
Czech	cs	16.99M	19.11B
Welsh	cy	111.25K	124.06M
Danish	da	8.18M	9.37B
German	de	101.41M	110.98B
Greek	el	15.83M	33.76B
English	en	1.02B	2.31T
Esperanto	eo	67.81K	101.70M
Spanish	es	129.29M	181.23B
Estonian	et	1.48M	1.74B
Basque	eu	343.95K	324.64M
Persian	fa	30.90M	47.58B
Finnish	fi	7.15M	9.04B
French	fr	99.59M	122.88B
Irish	ga	115.53K	130.68M
Galician	gl	731.36K	847.40M
Gujarati	gu	264.82K	303.63M
Serbo-Croatian	hbs	8.68M	10.03B
Hebrew	he	4.98M	7.49B
Hindi	hi	5.77M	7.54B
Hungarian	hu	11.71M	14.39B
Armenian	hy	621.47K	589.95M
Indonesian	id	31.42M	42.08B
Icelandic	is	481.33K	562.01M
Italian	it	53.53M	74.45B
Japanese	ja	190.41M	63.23B
Georgian	ka	533.07K	573.88M
Kazakh	kk	406.35K	471.76M
Kannada	kn	228.22K	235.58M
Korean	ko	31.85M	25.52B
Kyrgyz	ky	88.32K	101.62M
Latin	la	301.70K	294.13M
Lithuanian	lt	2.72M	2.95B
Latvian	lv	1.54M	1.59B
Macedonian	mk	734.69K	736.55M
Malayalam	ml	469.98K	517.83M
Mongolian	mn	594.90K	803.21M
Marathi	mr	453.69K	519.55M
Malay	ms	4.87M	9.03B
Maltese	mt	111.12K	102.42M
Burmese	my	239.47K	357.11M
Norwegian Bokmål	nb	6.12M	8.30B
Nepali	ne	863.35K	694.40M
Dutch	nl	31.75M	33.30B
Norwegian Nynorsk	nn	228.48K	298.57M
Punjabi	pa	152.78K	184.77M
Polish	pl	39.38M	44.17B
Pashto	ps	88.21K	113.19M
Portuguese	pt	58.24M	81.41B
Romanian	ro	14.47M	19.49B
Russian	ru	224.20M	284.58B
Sinhala	si	322.51K	568.03M
Slovak	sk	4.62M	4.98B
Slovenian	sl	2.20M	2.51B
Somali	so	283.71K	211.80M
Albanian	sq	1.24M	1.34B
Swedish	sv	13.67M	16.91B
Swahili	sw	698.57K	668.17M
Tamil	ta	1.24M	1.91B
Telugu	te	415.60K	437.74M
Thai	th	8.19M	4.33B
Filipino	tl	585.24K	911.06M
Turkish	tr	27.05M	42.65B
Tatar	tt	65.15K	74.86M
Ukrainian	uk	9.31M	10.57B
Urdu	ur	1.44M	1.42B
Uzbek	uz	290.29K	367.25M
Vietnamese	vi	31.50M	49.36B
Chinese	zh	1.08B	432.88B

mozilla / translations

Integrate HPLT Datasets v1.2 as a monolingual dataset #537