When using the Dataset.to_json() function, an unexpected error occurs if the parameter is set to lines=False. The stored data should be in the form of a list, but it actually turns into multiple lists, which causes an error when reading the data again.
The reason is that to_json() writes to the file in several segments based on the batch size. This is not a problem when lines=True, but it is incorrect when lines=False, because writing in several times will produce multiple lists(when len(dataset) > batch_size).
Steps to reproduce the bug
try this code:
from datasets import load_dataset
import json
train_dataset = load_dataset("Anthropic/hh-rlhf", data_dir="harmless-base")["train"]
output_path = "./harmless-base_hftojs.json"
print(len(train_dataset))
train_dataset.to_json(output_path, lines=False, force_ascii=False, indent=2)
with open(output_path, encoding="utf-8") as f:
data = json.loads(f.read())
it raise error: json.decoder.JSONDecodeError: Extra data: line 4003 column 1 (char 1373709)
Describe the bug
When using the Dataset.to_json() function, an unexpected error occurs if the parameter is set to lines=False. The stored data should be in the form of a list, but it actually turns into multiple lists, which causes an error when reading the data again. The reason is that to_json() writes to the file in several segments based on the batch size. This is not a problem when lines=True, but it is incorrect when lines=False, because writing in several times will produce multiple lists(when len(dataset) > batch_size).
Steps to reproduce the bug
try this code:
it raise error: json.decoder.JSONDecodeError: Extra data: line 4003 column 1 (char 1373709)
Extra square brackets have appeared here:
Expected behavior
The code runs normally.
Environment info
datasets=2.20.0