Open brando90 opened 10 months ago
Hi @brando90 , if you set the name
argument to default, the entire RedPajama-1T dataset gets loaded (note that this requires ~3T of disk space).
If you are interested only in one specific split of the dataset, you choose among arxiv
, book
, c4
, common_crawl
, github
, stackexchange
, wikipedia
.
Hi @brando90 , if you set the
name
argument to default, the entire RedPajama-1T dataset gets loaded (note that this requires ~3T of disk space).If you are interested only in one specific split of the dataset, you choose among
arxiv
,book
,c4
,common_crawl
,github
,stackexchange
,wikipedia
.
like this?
path, name = 'togethercomputer/RedPajama-Data-1T', 'default' # https://github.com/togethercomputer/RedPajama-Data/issues/70, https://github.com/togethercomputer/RedPajama-Data
path, name = 'cerebras/SlimPajama-627B', 'default' # https://github.com/togethercomputer/RedPajama-Data/issues/70, https://github.com/togethercomputer/RedPajama-Data
Note I'm after the slim one @mauriceweber
Code is asking me for a name e.g.,
I want to use all the data sets. Is the "default" the right argument?