huggingface / datasets

šŸ¤— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.02k stars 2.63k forks source link

Strange bug in loading local JSON files, using load_dataset #5955

Closed Night-Quiet closed 1 year ago

Night-Quiet commented 1 year ago

Describe the bug

I am using 'load_dataset 'loads a JSON file, but I found a strange bug: an error will be reported when the length of the JSON file exceeds 160000 (uncertain exact number). I have checked the data through the following code and there are no issues. So I cannot determine the true reason for this error.

The data is a list containing a dictionary. As follows:

[ {'input': 'someting...', 'target': 'someting...', 'type': 'someting...', 'history': ['someting...', ...]}, ... ]

Steps to reproduce the bug

import json
from datasets import load_dataset

path = "target.json"
temp_path = "temp.json"

with open(path, "r") as f:
    data = json.load(f)
    print(f"\n-------the JSON file length is: {len(data)}-------\n")

with open(temp_path, "w") as f:
    json.dump(data[:160000], f)
dataset = load_dataset("json", data_files=temp_path)
print("\n-------This works when the JSON file length is 160000-------\n")

with open(temp_path, "w") as f:
    json.dump(data[160000:], f)
dataset = load_dataset("json", data_files=temp_path)
print("\n-------This works and eliminates data issues-------\n")

with open(temp_path, "w") as f:
    json.dump(data[:170000], f)
dataset = load_dataset("json", data_files=temp_path)

Expected behavior

-------the JSON file length is: 173049-------

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-acf3c7f418c5f4b4/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 1/1 [00:00<00:00, 3328.81it/s]
Extracting data files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 1/1 [00:00<00:00, 639.47it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-acf3c7f418c5f4b4/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 1/1 [00:00<00:00, 265.85it/s]

-------This works when the JSON file length is 160000-------

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-a42f04b263ceea6a/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 1/1 [00:00<00:00, 2038.05it/s]
Extracting data files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 1/1 [00:00<00:00, 794.83it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-a42f04b263ceea6a/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 1/1 [00:00<00:00, 681.00it/s]

-------This works and eliminates data issues-------

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-63f391c89599c7b0/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 1/1 [00:00<00:00, 3682.44it/s]
Extracting data files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 1/1 [00:00<00:00, 788.70it/s]
Generating train split: 0 examples [00:00, ? examples/s]Failed to read file '/home/lakala/hjc/code/pycode/glm/temp.json' with error <class 'pyarrow.lib.ArrowInvalid'>: cannot mix list and non-list, non-null values
Traceback (most recent call last):
  File "/home/lakala/conda/envs/glm/lib/python3.8/site-packages/datasets/builder.py", line 1858, in _prepare_split_single
    for _, table in generator:
  File "/home/lakala/conda/envs/glm/lib/python3.8/site-packages/datasets/packaged_modules/json/json.py", line 146, in _generate_tables
    raise ValueError(f"Not able to read records in the JSON file at {file}.") from None
ValueError: Not able to read records in the JSON file at /home/lakala/hjc/code/pycode/glm/temp.json.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/lakala/hjc/code/pycode/glm/test.py", line 22, in <module>
    dataset = load_dataset("json", data_files=temp_path)
  File "/home/lakala/conda/envs/glm/lib/python3.8/site-packages/datasets/load.py", line 1797, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/lakala/conda/envs/glm/lib/python3.8/site-packages/datasets/builder.py", line 890, in download_and_prepare
    self._download_and_prepare(
  File "/home/lakala/conda/envs/glm/lib/python3.8/site-packages/datasets/builder.py", line 985, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/lakala/conda/envs/glm/lib/python3.8/site-packages/datasets/builder.py", line 1746, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/lakala/conda/envs/glm/lib/python3.8/site-packages/datasets/builder.py", line 1891, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Environment info

Ubuntu==22.04

python==3.8

pytorch-transformers==1.2.0
transformers== 4.27.1
datasets==2.12.0
numpy==1.24.3
pandas==1.5.3
mariosasko commented 1 year ago

This is the actual error:

Failed to read file '/home/lakala/hjc/code/pycode/glm/temp.json' with error <class 'pyarrow.lib.ArrowInvalid'>: cannot mix list and non-list, non-null values

Which means some samples are incorrectly formatted.

PyArrow, a storage backend that we use under the hood, requires that all the list elements have the same level of nesting (same number of dimensions) or are None.

import pyarrow as pa
pa.array([[1, 2, 3], 2]) # ArrowInvalid: cannot mix list and non-list, non-null values
pa.array([[1, 2, 3], [2]]) # works
Night-Quiet commented 1 year ago

@mariosasko I used the same operation to check the original data before and after slicing. This is reflected in my code. 160000 is not a specific number. I can also get output using 150000. This doesn't seem to align very well with what you said. Because if only some sample formats are incorrect. So there should be an error in one of the front and back slices. thank you for your reply.

mariosasko commented 1 year ago

Our JSON loader does the following in your case:

import json
import pyarrow as pa

with open(file, encoding="utf-8") as f:
    dataset = json.load(f)
keys = set().union(*[row.keys() for row in dataset])
mapping = {col: [row.get(col) for row in dataset] for col in keys}
pa_table = pa.Table.from_pydict(mapping) # the ArrowInvalid error comes from here

So if this code throws an error with correctly-formatted JSON, then this is an Arrow bug and should be reported in their repo.

I used the same operation to check the original data before and after slicing. This is reflected in my code. 160000 is not a specific number. I can also get output using 150000. This doesn't seem to align very well with what you said. Because if only some sample formats are incorrect. So there should be an error in one of the front and back slices.

You should shuffle the data to make sure that's not the case

Night-Quiet commented 1 year ago

@mariosasko Thank you. I will try again.