togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Impossible unpack tail data... took time to download, but impossible to unpack dataset without quality signals with broken link. #94

Closed RuslanKovalyov closed 5 months ago

RuslanKovalyov commented 6 months ago

An error occurred while processing the 'common_crawl' configuration: Couldn't find file at https://data.together.xyz/redpajama-data-v2/v1.0.0/quality_signals/2023-14/0000/en_tail.signals.json.gz


open link with browser: Error 404 This object could not be viewed

lipingtang17 commented 5 months ago

It seems that the authors did not generate quality signals for tail data. image