huggingface / olm-datasets

Pipeline for pulling and processing online language model pretraining data from the web
Apache License 2.0
173 stars 23 forks source link

CC data Language Splits #6

Open KeremTurgutlu opened 1 year ago

KeremTurgutlu commented 1 year ago

Thanks a lot for putting this repo together and providing the fresh CC dumps at HF. I was looking for a way to find dataset splits for other languages but couldn't find a way to do it. Are datasets olm/olm-CC-MAIN-* monolingual by chance?

spate141 commented 1 year ago

They have only processed and uploaded en text from WET files processed by get_text_dataset_from_wet_downloads.py script. That script is using fastText language identification model to identify text language and will save the text generated from WET files into individual languages directories.

KeremTurgutlu commented 1 year ago

Thanks for the reply. I was just surprised not to see other languages since they are already processed in the code like you've mentioned. I couldn't find any language specific filter in this code. Also, here it looks like all language ids are uploaded https://github.com/huggingface/olm-datasets/blob/535c2c9250539cf3277d74e2ff664ba98c1ca033/pipeline_scripts/common_crawl/get_text_dataset_from_wet_downloads.py#L96. Maybe later they decided and manually uploaded only en?

It actually gets filtered in bloom filter stage where there is a lang id arg. So the previous uploads of all languages (in the first stage) were not public I assume.

@TristanThrush would it be possible to make other language splits public if they are readily available? If not would it be possible in future snapshot jobs? Thanks✌️

spate141 commented 1 year ago

Yes, there are no filters. Files for other languages (120+) are being generated using that script in their respective directories. Not sure why they were not made available. Maybe because they still require you to apply bloom filter and it's resource expensive.