togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Difference between RedPajama-Data-1T, RedPajama-Data-V2, RedPajama-Data-V1 #112

Open konradipipan opened 3 months ago

konradipipan commented 3 months ago

Is 1T version basically V1? If so, is the HF version of V1 (1T) already deduplicated are ready to be used?

mauriceweber commented 3 months ago

Hi @konradipipan -- You are correct, RedPajama-Data-1T on HF corresponds to v1. This dataset is not deduplicated. If you want a deduplicated version, you can check out SlimPajama, which is a version of RPv1 which is cleaned and deduplicated across dataset slices with MinHashLSH.