数据集数量问题 - Githubissues

Describe the bug

这里我进行大模型微调，当数据集数量为718时，模型可以正常微调，但是当我添加一个在前718个数据集中的数据或者新增一个数据就会报错

Steps to reproduce the bug

1. 这里我的数据集可以微调的最后两个数据集是： { "messages": [ { "role": "user", "content": "完成校正装置设计后需要进行哪些工作？" }, { "role": "assistant", "content": "一旦完成校正装置设计后，需要进行系统实际调校工作，或者进行MATLAB仿真以检查系统的时间响应特性。这时，需将系统建模时省略的部分尽可能加入系统，以保证仿真结果的逼真度。如果由于系统各种固有非线性因素影响，或者由于系统噪声和负载效应等因素的影响，使已校正系统不能满足全部性能指标要求，则需要适当调整校正装置的形式或参数，直到已校正系统满足全部性能指标为止。" } ] }

{ "messages": [ { "role": "user", "content": "为什么闭环带宽要求会影响串联超前校正的效果？" }, { "role": "assistant", "content": "如果待校正系统不稳定，为了满足相角裕度的要求，超前网络需要提供很大的相角超前量。这会导致超前网络的a值必须选得很大，从而使已校正系统的带宽过大。带宽过大会使通过系统的高频噪声电平变高，可能导致系统失控。因此，闭环带宽的要求会影响串联超前校正的效果。" } ] } 2. 当我把倒数第二个数据集再次添加到数据集中时，此时我的最后三个数据集是 { "messages": [ { "role": "user", "content": "完成校正装置设计后需要进行哪些工作？" }, { "role": "assistant", "content": "一旦完成校正装置设计后，需要进行系统实际调校工作，或者进行MATLAB仿真以检查系统的时间响应特性。这时，需将系统建模时省略的部分尽可能加入系统，以保证仿真结果的逼真度。如果由于系统各种固有非线性因素影响，或者由于系统噪声和负载效应等因素的影响，使已校正系统不能满足全部性能指标要求，则需要适当调整校正装置的形式或参数，直到已校正系统满足全部性能指标为止。" } ] }

{ "messages": [ { "role": "user", "content": "完成校正装置设计后需要进行哪些工作？" }, { "role": "assistant", "content": "一旦完成校正装置设计后，需要进行系统实际调校工作，或者进行MATLAB仿真以检查系统的时间响应特性。这时，需将系统建模时省略的部分尽可能加入系统，以保证仿真结果的逼真度。如果由于系统各种固有非线性因素影响，或者由于系统噪声和负载效应等因素的影响，使已校正系统不能满足全部性能指标要求，则需要适当调整校正装置的形式或参数，直到已校正系统满足全部性能指标为止。" } ] } 这时系统会显示bug： root@autodl-container-027f4cad3d-6baf4e64:~/autodl-tmp# python GLM-4/finetune_demo/finetune.py datasets/ ZhipuAI/glm-4-9b-chat GLM-4/finetune_demo/configs/lora.yaml Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00, 4.04it/s] The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. trainable params: 2,785,280 || all params: 9,402,736,640 || trainable%: 0.0296 Generating train split: 0 examples [00:00, ? examples/s]Failed to load JSON from file '/root/autodl-tmp/datasets/train.jsonl' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a name for object member. in row 718 Generating train split: 0 examples [00:00, ? examples/s] ╭──────────────────────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ /root/miniconda3/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:153 in _generate_tables │ │ │ │ 150 │ │ │ │ │ │ │ │ with open( │ │ 151 │ │ │ │ │ │ │ │ │ file, encoding=self.config.encoding, errors=self.con │ │ 152 │ │ │ │ │ │ │ │ ) as f: │ │ ❱ 153 │ │ │ │ │ │ │ │ │ df = pd.read_json(f, dtype_backend="pyarrow") │ │ 154 │ │ │ │ │ │ │ except ValueError: │ │ 155 │ │ │ │ │ │ │ │ logger.error(f"Failed to load JSON from file '{file}' wi │ │ 156 │ │ │ │ │ │ │ │ raise e │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:815 in read_json │ │ │ │ 812 │ if chunksize: │ │ 813 │ │ return json_reader │ │ 814 │ else: │ │ ❱ 815 │ │ return json_reader.read() │ │ 816 │ │ 817 │ │ 818 class JsonReader(abc.Iterator, Generic[FrameSeriesStrT]): │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1025 in read │ │ │ │ 1022 │ │ │ │ │ │ data_lines = data.split("\n") │ │ 1023 │ │ │ │ │ │ obj = self._get_object_parser(self._combine_lines(data_lines)) │ │ 1024 │ │ │ │ else: │ │ ❱ 1025 │ │ │ │ │ obj = self._get_object_parser(self.data) │ │ 1026 │ │ │ │ if self.dtype_backend is not lib.no_default: │ │ 1027 │ │ │ │ │ return obj.convert_dtypes( │ │ 1028 │ │ │ │ │ │ infer_objects=False, dtype_backend=self.dtype_backend │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1051 in _get_object_parser │ │ │ │ 1048 │ │ } │ │ 1049 │ │ obj = None │ │ 1050 │ │ if typ == "frame": │ │ ❱ 1051 │ │ │ obj = FrameParser(json, **kwargs).parse() │ │ 1052 │ │ │ │ 1053 │ │ if typ == "series" or obj is None: │ │ 1054 │ │ │ if not isinstance(dtype, bool): │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1187 in parse │ │ │ │ 1184 │ │ │ 1185 │ @final │ │ 1186 │ def parse(self): │ │ ❱ 1187 │ │ self._parse() │ │ 1188 │ │ │ │ 1189 │ │ if self.obj is None: │ │ 1190 │ │ │ return None │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1403 in _parse │ │ │ │ 1400 │ │ │ │ 1401 │ │ if orient == "columns": │ │ 1402 │ │ │ self.obj = DataFrame( │ │ ❱ 1403 │ │ │ │ ujson_loads(json, precise_float=self.precise_float), dtype=None │ │ 1404 │ │ │ ) │ │ 1405 │ │ elif orient == "split": │ │ 1406 │ │ │ decoded = { │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: Trailing data

During handling of the above exception, another exception occurred:

The above exception was the direct cause of the following exception:

╭──────────────────────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ /root/autodl-tmp/GLM-4/finetune_demo/finetune.py:406 in main │ │ │ │ 403 ): │ │ 404 │ ft_config = FinetuningConfig.from_file(config_file) │ │ 405 │ tokenizer, model = load_tokenizer_and_model(model_dir, peft_config=ft_config.peft_co │ │ ❱ 406 │ data_manager = DataManager(data_dir, ft_config.data_config) │ │ 407 │ │ │ 408 │ train_dataset = data_manager.get_dataset( │ │ 409 │ │ Split.TRAIN, │ │ │ │ /root/autodl-tmp/GLM-4/finetune_demo/finetune.py:204 in init │ │ │ │ 201 │ def init(self, data_dir: str, data_config: DataConfig): │ │ 202 │ │ self._num_proc = data_config.num_proc │ │ 203 │ │ │ │ ❱ 204 │ │ self._dataset_dct = _load_datasets( │ │ 205 │ │ │ data_dir, │ │ 206 │ │ │ data_config.data_format, │ │ 207 │ │ │ data_config.data_files, │ │ │ │ /root/autodl-tmp/GLM-4/finetune_demo/finetune.py:189 in _load_datasets │ │ │ │ 186 │ │ num_proc: Optional[int], │ │ 187 ) -> DatasetDict: │ │ 188 │ if data_format == '.jsonl': │ │ ❱ 189 │ │ dataset_dct = load_dataset( │ │ 190 │ │ │ data_dir, │ │ 191 │ │ │ data_files=data_files, │ │ 192 │ │ │ split=None, │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/load.py:2616 in load_dataset │ │ │ │ 2613 │ │ return builder_instance.as_streaming_dataset(split=split) │ │ 2614 │ │ │ 2615 │ # Download and prepare data │ │ ❱ 2616 │ builder_instance.download_and_prepare( │ │ 2617 │ │ download_config=download_config, │ │ 2618 │ │ download_mode=download_mode, │ │ 2619 │ │ verification_mode=verification_mode, │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:1029 in download_and_prepare │ │ │ │ 1026 │ │ │ │ │ │ │ prepare_split_kwargs["max_shard_size"] = max_shard_size │ │ 1027 │ │ │ │ │ │ if num_proc is not None: │ │ 1028 │ │ │ │ │ │ │ prepare_split_kwargs["num_proc"] = num_proc │ │ ❱ 1029 │ │ │ │ │ │ self._download_and_prepare( │ │ 1030 │ │ │ │ │ │ │ dl_manager=dl_manager, │ │ 1031 │ │ │ │ │ │ │ verification_mode=verification_mode, │ │ 1032 │ │ │ │ │ │ │ prepare_split_kwargs, │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:1124 in _download_and_prepare │ │ │ │ 1121 │ │ │ │ │ 1122 │ │ │ try: │ │ 1123 │ │ │ │ # Prepare split will record examples associated to the split │ │ ❱ 1124 │ │ │ │ self._prepare_split(split_generator, prepare_split_kwargs) │ │ 1125 │ │ │ except OSError as e: │ │ 1126 │ │ │ │ raise OSError( │ │ 1127 │ │ │ │ │ "Cannot find data file. " │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:1884 in _prepare_split │ │ │ │ 1881 │ │ │ gen_kwargs = split_generator.gen_kwargs │ │ 1882 │ │ │ job_id = 0 │ │ 1883 │ │ │ with pbar: │ │ ❱ 1884 │ │ │ │ for job_id, done, content in self._prepare_split_single( │ │ 1885 │ │ │ │ │ gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args │ │ 1886 │ │ │ │ ): │ │ 1887 │ │ │ │ │ if done: │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:2040 in _prepare_split_single │ │ │ │ 2037 │ │ │ │ e = e.context │ │ 2038 │ │ │ if isinstance(e, DatasetGenerationError): │ │ 2039 │ │ │ │ raise │ │ ❱ 2040 │ │ │ raise DatasetGenerationError("An error occurred while generating the dataset │ │ 2041 │ │ │ │ 2042 │ │ yield job_id, True, (total_num_examples, total_num_bytes, writer.features, num │ │ 2043 │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ DatasetGenerationError: An error occurred while generating the dataset 3.请问是否可以帮我解决

Expected behavior

希望问题可以得到解决

Environment info

Copy-and-paste the text below in your GitHub issue.

datasets version: 2.20.0
Platform: Linux-4.19.90-2107.6.0.0192.8.oe1.bclinux.x86_64-x86_64-with-glibc2.35
Python version: 3.10.8
huggingface_hub version: 0.24.6
PyArrow version: 16.1.0
Pandas version: 2.2.2
fsspec version: 2023.12.2

huggingface / datasets

数据集数量问题 #7233

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info