togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

quality_signals, minhash and duplicates missing for tail #77

Closed Sheshansh closed 7 months ago

Sheshansh commented 8 months ago

quality_signals, minhash and duplicates files are missing from BASE_URL for tail partition of the dataset.

For example, this fails with Error 404. wget https://data.together.xyz/redpajama-data-v2/v1.0.0/quality_signals/2023-06/0001/en_tail.signals.json.gz

Is this intended?

mauriceweber commented 8 months ago

see huggingface issue:

Hi @sheshanshag ,

The link you provided points to the tail partition of the dataset which only includes the text data (i.e. the documents/ paths). We computed the quality signals, minhashes and duplicates only for the head_middle partition of the dataset.

I updated the readme to make this more clear. Thanks for letting us know!