Closed sagnak closed 5 months ago
Hi @sagnak , thanks for flagging that -- there are indeed a few files in the dataset which are broken (a few hundred in total, mostly resulting from failed S3 queries). You can safely skip these.
See also the discussion on Huggingface: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2/discussions/20
Thanks for the huggingface pointer! Indeed all those aforementioned shards are also broken for me
I cannot get a valid shard from https://data.together.xyz/redpajama-data-v2/v1.0.0/documents/2014-52/1940/es_middle.json.gz
can anyone else replicate this, or is the issue on my end?