togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

Recommended way to load wget-downloaded data using HF datasets API? #100

Open zijwang opened 8 months ago

zijwang commented 8 months ago

I downloaded the data following the instruction here. Is there a recommended way that I can load it via HF API similar to this?

mauriceweber commented 8 months ago

Hi @zijwang , my guess is that you can use the RPv2 data loader script here and modify the _URL_BASE variable to match the base directory on your filesystem. You should then be able to pass your data loading script to datasets.load_dataset (here is an explanation about this).