huggingface / datasets

šŸ¤— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

TypeError: Couldn't cast array of type string to null in long json #7222

Open nokados opened 1 month ago

nokados commented 1 month ago

Describe the bug

In general, changing the type from string to null is allowed within a dataset ā€” there are even examples of this in the documentation.

However, if the dataset is large and unevenly distributed, this allowance stops working. The schema gets locked in after reading a chunk.

Consequently, if all values in the first chunk of a field are, for example, null, the field will be locked as type null, and if a string appears in that field in the second chunk, it will trigger this error:

Traceback ``` TypeError Traceback (most recent call last) [/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id) 1868 try: -> 1869 writer.write_table(table) 1870 except CastError as cast_error: 14 frames [/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py](https://localhost:8080/#) in write_table(self, pa_table, writer_batch_size) 579 pa_table = pa_table.combine_chunks() --> 580 pa_table = table_cast(pa_table, self._schema) 581 if self.embed_local_files: [/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in table_cast(table, schema) 2291 if table.schema != schema: -> 2292 return cast_table_to_schema(table, schema) 2293 elif table.schema.metadata != schema.metadata: [/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in cast_table_to_schema(table, schema) 2244 ) -> 2245 arrays = [ 2246 cast_array_to_feature( [/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in (.0) 2245 arrays = [ -> 2246 cast_array_to_feature( 2247 table[name] if name in table_column_names else pa.array([None] * len(table), type=schema.field(name).type), [/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in wrapper(array, *args, **kwargs) 1794 if isinstance(array, pa.ChunkedArray): -> 1795 return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks]) 1796 else: [/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in (.0) 1794 if isinstance(array, pa.ChunkedArray): -> 1795 return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks]) 1796 else: [/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in cast_array_to_feature(array, feature, allow_primitive_to_str, allow_decimal_to_str) 2101 elif not isinstance(feature, (Sequence, dict, list, tuple)): -> 2102 return array_cast( 2103 array, [/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in wrapper(array, *args, **kwargs) 1796 else: -> 1797 return func(array, *args, **kwargs) 1798 [/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in array_cast(array, pa_type, allow_primitive_to_str, allow_decimal_to_str) 1947 if pa.types.is_null(pa_type) and not pa.types.is_null(array.type): -> 1948 raise TypeError(f"Couldn't cast array of type {_short_str(array.type)} to {_short_str(pa_type)}") 1949 return array.cast(pa_type) TypeError: Couldn't cast array of type string to null The above exception was the direct cause of the following exception: DatasetGenerationError Traceback (most recent call last) [](https://localhost:8080/#) in () ----> 1 dd = load_dataset("json", data_files=["TEST.json"]) [/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs) 2094 2095 # Download and prepare data -> 2096 builder_instance.download_and_prepare( 2097 download_config=download_config, 2098 download_mode=download_mode, [/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, dl_manager, base_path, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs) 922 if num_proc is not None: 923 prepare_split_kwargs["num_proc"] = num_proc --> 924 self._download_and_prepare( 925 dl_manager=dl_manager, 926 verification_mode=verification_mode, [/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs) 997 try: 998 # Prepare split will record examples associated to the split --> 999 self._prepare_split(split_generator, **prepare_split_kwargs) 1000 except OSError as e: 1001 raise OSError( [/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split(self, split_generator, file_format, num_proc, max_shard_size) 1738 job_id = 0 1739 with pbar: -> 1740 for job_id, done, content in self._prepare_split_single( 1741 gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args 1742 ): [/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id) 1894 if isinstance(e, DatasetGenerationError): 1895 raise -> 1896 raise DatasetGenerationError("An error occurred while generating the dataset") from e 1897 1898 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths) DatasetGenerationError: An error occurred while generating the dataset ```

Steps to reproduce the bug

import json
from datasets import load_dataset

with open("TEST.json", "w") as f:
    row = {"ballast": "qwerty" * 1000, "b": None}
    row_str = json.dumps(row) + "\n"
    line_size = len(row_str)
    chunk_size = 10 << 20
    lines_in_chunk = chunk_size // line_size + 1
    print(f"Writing {lines_in_chunk} lines")
    for i in range(lines_in_chunk):
        f.write(row_str)
    null_row = {"ballast": "Gotcha", "b": "Not Null"}
    f.write(json.dumps(null_row) + "\n")

load_dataset("json", data_files=["TEST.json"])

Expected behavior

Concatenation of the chunks without errors

Environment info