togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Other language data #93

Open Dzg0309 opened 6 months ago

Dzg0309 commented 6 months ago

Thank you very much for your work in providing such rich data to the open source community, I was wondering if there are any plans for release in other languages, such as Chinese? I think Chinese data is also a need for most people.

mauriceweber commented 5 months ago

Hi @Dzg0309 -- currently we don't have plans to release data in other languages. However, if you want to create such a dataset (e.g. in Chinese), you can use the CCNet pipeline and the scripts in this repo to compute quality signals and deduplicate the corpus. Note that in other languages you will likely have to adapt the quality signals.

Dzg0309 commented 5 months ago

Hi @Dzg0309 -- currently we don't have plans to release data in other languages. However, if you want to create such a dataset (e.g. in Chinese), you can use the CCNet pipeline and the scripts in this repo to compute quality signals and deduplicate the corpus. Note that in other languages you will likely have to adapt the quality signals.

Thank you very much for your reply. It is very difficult for us to filter Chinese data from the original large-scale CommonCrawl because we cannot handle such a large CC dump package. Is there a channel to obtain language-differentiated data? Chinese raw data? In this way, we can process and generate Chinese data based on CCNet and the library you provided.

davidrpugh commented 1 month ago

@mauriceweber I am a faculty member at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia. I am about to kick-off a project to apply these workflows to prepare the Arabic language subset with the goal of contributing the Arabic language subset to the next version of this dataset. Would there be interest in collaborating on this project? We have technical skills and plenty of compute so what we really need is general guidance if we get stuck.

@Dzg0309 depending on how much resources we need to use to prepare the Arabic data we may be able to also prepare the data for other languages.

mauriceweber commented 1 month ago

Hi @davidrpugh , awesome to hear that! I'm happy to provide any guidance you need and open for collaboration on this!:)