togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

Running the pipeline on cloud or a big data platform #104

Open zllai opened 7 months ago

zllai commented 7 months ago

Dear RedPajama team,

I apologize this might not be the right place to ask questions, but I was curious on several aspects of your projects and couldn’t find other better ways to reach out.

I'm a PhD student at CUHK and recently came across your amazing project. I was impressed by the size of the dataset and the fact that it only requires a few python scripts for preparing the data in such a great volume.

I wonder how many CPUs did you use and how much time did that take. Have you explored using some big data platforms like Spark, Flink, or Hadoop to facilitate the distributed data processing and storage; or did you consider utilizing cloud services to reduce the cost? When developing the pipeline, how did you manage the evolving versions of the pipeline and evaluate the quality of dataset that each version of pipeline generates?

My current research is to design a low cost platform for LLM data preparation on the cloud. Your insights would greatly assist researchers like myself.

Best, Bruce

mauriceweber commented 7 months ago

Hi @zllai, thanks for your questions!

I wonder how many CPUs did you use and how much time did that take

We used 16 aws nodes with 64 CPU cores and 500GB of RAM for the largest part of the pipeline -- the total processing took around 2 months with that setup.

Have you explored using some big data platforms like Spark, Flink, or Hadoop to facilitate the distributed data processing and storage;

We have not explored that, but using a framework for scheduling jobs on the different nodes would definitely be useful.

When developing the pipeline, how did you manage the evolving versions of the pipeline and evaluate the quality of dataset that each version of pipeline generates?

We used the same version of the pipeline for the entire run, so there was no risk of having conflicting versions when processing different parts of commoncrawl. Typically you would include unit tests to ensure (at least to some degree) the consistency of different versions of the pipeline.

For your research, I would also recommend to look into other tools such as datatrove by HuggingFace and the Dolma toolkit by Allen AI.