huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

A bug of Dataset.to_json() function #7037

Open LinglingGreat opened 4 months ago

LinglingGreat commented 4 months ago

Describe the bug

When using the Dataset.to_json() function, an unexpected error occurs if the parameter is set to lines=False. The stored data should be in the form of a list, but it actually turns into multiple lists, which causes an error when reading the data again. The reason is that to_json() writes to the file in several segments based on the batch size. This is not a problem when lines=True, but it is incorrect when lines=False, because writing in several times will produce multiple lists(when len(dataset) > batch_size).

Steps to reproduce the bug

try this code:

from datasets import load_dataset
import json

train_dataset = load_dataset("Anthropic/hh-rlhf", data_dir="harmless-base")["train"]
output_path = "./harmless-base_hftojs.json"
print(len(train_dataset))
train_dataset.to_json(output_path, lines=False, force_ascii=False, indent=2)

with open(output_path, encoding="utf-8") as f:
    data = json.loads(f.read())

it raise error: json.decoder.JSONDecodeError: Extra data: line 4003 column 1 (char 1373709)

Extra square brackets have appeared here:

image

Expected behavior

The code runs normally.

Environment info

datasets=2.20.0

albertvillanova commented 4 months ago

Thanks for reporting, @LinglingGreat.

I confirm this is a bug.

varadhbhatnagar commented 2 months ago

@albertvillanova I would like to take a shot at this if you aren't working on it currently. Let me know!