Closed Sheshansh closed 7 months ago
see huggingface issue:
Hi @sheshanshag ,
The link you provided points to the tail partition of the dataset which only includes the text data (i.e. the documents/ paths). We computed the quality signals, minhashes and duplicates only for the head_middle partition of the dataset.
I updated the readme to make this more clear. Thanks for letting us know!
quality_signals, minhash and duplicates files are missing from BASE_URL for tail partition of the dataset.
For example, this fails with Error 404.
wget https://data.together.xyz/redpajama-data-v2/v1.0.0/quality_signals/2023-06/0001/en_tail.signals.json.gz
Is this intended?