togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Inconsistent IDs lead to distributed computing woes. #111

Open axelmagn opened 3 months ago

axelmagn commented 3 months ago

When trying to work with these data via Dataflow, I noticed a few things:

This creates a lot of unnecessary friction when working with big data pipelines, since line number is not usually available. I'm finding myself writing a custom reader (sort of a bummer if you've ever had to do it).

For future data releases, please consider embedding a consistent key between all file groups for easier joining at scale. Just a UUID would be fine.

mauriceweber commented 2 months ago

Hi @axelmagn thanks for your feedback, these are very good points and is something we will definitely do in future releases.