togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

possibly missing shard from host #97

Closed sagnak closed 5 months ago

sagnak commented 5 months ago

I cannot get a valid shard from https://data.together.xyz/redpajama-data-v2/v1.0.0/documents/2014-52/1940/es_middle.json.gz

can anyone else replicate this, or is the issue on my end?

mauriceweber commented 5 months ago

Hi @sagnak , thanks for flagging that -- there are indeed a few files in the dataset which are broken (a few hundred in total, mostly resulting from failed S3 queries). You can safely skip these.

See also the discussion on Huggingface: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2/discussions/20

sagnak commented 5 months ago

Thanks for the huggingface pointer! Indeed all those aforementioned shards are also broken for me