togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

What does "default" do in `load_dataset('togethercomputer/RedPajama-Data-1T', "default")`? #70

Open brando90 opened 10 months ago

brando90 commented 10 months ago

Code is asking me for a name e.g.,

`load_dataset('togethercomputer/RedPajama-Data-1T', "default")`?

I want to use all the data sets. Is the "default" the right argument?

brando90 commented 10 months ago

https://discord.com/channels/1082503318624022589/1097534874719625236/1143268159256789032

mauriceweber commented 10 months ago

Hi @brando90 , if you set the name argument to default, the entire RedPajama-1T dataset gets loaded (note that this requires ~3T of disk space).

If you are interested only in one specific split of the dataset, you choose among arxiv, book, c4, common_crawl, github, stackexchange, wikipedia.

brando90 commented 10 months ago

Hi @brando90 , if you set the name argument to default, the entire RedPajama-1T dataset gets loaded (note that this requires ~3T of disk space).

If you are interested only in one specific split of the dataset, you choose among arxiv, book, c4, common_crawl, github, stackexchange, wikipedia.

like this?

    path, name = 'togethercomputer/RedPajama-Data-1T', 'default'  # https://github.com/togethercomputer/RedPajama-Data/issues/70, https://github.com/togethercomputer/RedPajama-Data
    path, name = 'cerebras/SlimPajama-627B', 'default'  # https://github.com/togethercomputer/RedPajama-Data/issues/70, https://github.com/togethercomputer/RedPajama-Data
Note I'm after the slim one @mauriceweber