Open konradipipan opened 3 months ago
Hi @konradipipan -- You are correct, RedPajama-Data-1T on HF corresponds to v1. This dataset is not deduplicated. If you want a deduplicated version, you can check out SlimPajama, which is a version of RPv1 which is cleaned and deduplicated across dataset slices with MinHashLSH.
Is 1T version basically V1? If so, is the HF version of V1 (1T) already deduplicated are ready to be used?