mlfoundations / dclm

DataComp for Language Models
MIT License
1.16k stars 108 forks source link

TypeError: Couldn't cast array of type #66

Open shizhediao opened 2 months ago

shizhediao commented 2 months ago

Hi,

When I ran the following command to download the dataset from hugginigface hub, I encountered an error:

My command:

from datasets import load_dataset

ds = load_dataset("mlfoundations/dclm-pool-400m-1x")

The error:

File /lustre/fsw/portfolios/table.py:2122, in cast_array_to_feature(array, feature, allow_primitive_to_str, allow_decimal_to_str)
   2116     return array_cast(
   2117         array,
   2118         feature(),
   2119         allow_primitive_to_str=allow_primitive_to_str,
   2120         allow_decimal_to_str=allow_decimal_to_str,
   2121     )
-> 2122 raise TypeError(f"Couldn't cast array of type\n{_short_str(array.type)}\nto\n{_short_str(feature)}")

TypeError: Couldn't cast array of type
struct<WARC-Type: string, WARC-Date: timestamp[s], WARC-Record-ID: string, Content-Length: string, Content-Type: string, WARC-Warcinfo-ID: string, WARC-Concurrent-To: string, WARC-IP-Address: string, WARC-Target-URI: string, WARC-Payload-Digest: string, WARC-Block-Digest: string, WARC-Identified-Payload-Type: string>
to
{'WARC-Type': Value(dtype='string', id=None), 'WARC-Date': Value(dtype='timestamp[s]', id=None), 'WARC-Record-ID': Value(dtype='string', id=None), 'Content-Length': Value(dtype='string', id=None), 'Content-Type': Value(dtype='string', id=None), 'WARC-Warcinfo-ID': Value(dtype='string', id=None), 'WARC-Concurrent-To': Value(dtype='string', id=None), 'WARC-IP-Address': Value(dtype='string', id=None), 'WARC-Target-URI': Value(dtype='string', id=None), 'WARC-Payload-Digest': Value(dtype='string', id=None), 'WARC-Block-Digest': Value(dtype='string', id=None), 'WARC-Identified-Payload-Type': Value(dtype='string', id=None), 'WARC-Truncated': Value(dtype='string', id=None)}

The above exception was the direct cause of the following exception:

Could you help take a look? Thanks!

GeorgiosSmyrnis commented 2 months ago

Hi @shizhediao ,

This error is caused by the fact that some of the metadata fields in the jsonl files contain the WARC-Truncated field, which is an optional field that can be found in some WARCs - I will look into how this can be resolved so that it is possible to load the dataset with load_dataset. In the meantime, I would recommend simply downloading the dataset from HF and using the jsonl.zst files separately for now.

However, I also want to point out that the dclm-pool-400m-1x dataset (and the other pool datasets) are not intended to be used directly for training - they only contain very minimal processing, and are intended to be processed further. As such, I would recommend doing so first and process each jsonl file individually (with our pipeline and/or your own implementations).

shizhediao commented 2 months ago

Thank you for your explanation! I would like to clean the pool datasets. Looking forward to the solutions! I will use the jsonl.zst for now. Thanks!