Open shizhediao opened 2 months ago
Hi @shizhediao ,
This error is caused by the fact that some of the metadata
fields in the jsonl files contain the WARC-Truncated
field, which is an optional field that can be found in some WARCs - I will look into how this can be resolved so that it is possible to load the dataset with load_dataset
. In the meantime, I would recommend simply downloading the dataset from HF and using the jsonl.zst
files separately for now.
However, I also want to point out that the dclm-pool-400m-1x
dataset (and the other pool
datasets) are not intended to be used directly for training - they only contain very minimal processing, and are intended to be processed further. As such, I would recommend doing so first and process each jsonl file individually (with our pipeline and/or your own implementations).
Thank you for your explanation! I would like to clean the pool
datasets. Looking forward to the solutions! I will use the jsonl.zst
for now. Thanks!
Hi,
When I ran the following command to download the dataset from hugginigface hub, I encountered an error:
My command:
The error:
Could you help take a look? Thanks!