modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.63k stars 166 forks source link

[Bug]: 运行tools/analyze_data.py报错,出现 KeyError: 'text' #296

Closed promisecc closed 5 months ago

promisecc commented 5 months ago

Before Reporting 报告之前

Search before reporting 先搜索,再报告

OS 系统

macos

Installation Method 安装方式

from source

Data-Juicer Version Data-Juicer版本

last version

Python Version Python版本

3.9

Describe the bug 描述这个bug

2024-04-15 16:15:39 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (83335), thread 'MainThread' (8040734208): multiprocess.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(args, kwds)) File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue for i, result in enumerate(func(kwargs)): File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single example = apply_function_on_filtered_inputs(example, i, offset=offset) File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, *additional_args, fn_kwargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f return f(*args, *kargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f return f(args, kargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats text = sample[self.text_key].lower().replace('\n', ' ') File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem value = self.data[key] KeyError: 'text' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in main() └ <function main at 0x122843dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main analyser.run() │ └ <function Analyser.run at 0x17edb4f70> └ <data_juicer.core.analyser.Analyser object at 0x101d91250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run dataset = dataset.map(op.compute_stats, │ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x16f77a280> │ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16f77fac0> │ └ <function NestedDataset.map at 0x17edb44c0> └ Dataset({ features: ['instruction', 'input', 'output', 'djstats__'], num_rows: 1000 })

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map new_ds = NestedDataset(super().map(*args, **kargs)) │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'} │ └ [<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>] └ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) │ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'} │ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,) │ │ └ Dataset({ │ │ features: ['instruction', 'input', 'output', 'djstats'], │ │ num_rows: 1000 │ │ }) │ └ <function Dataset.map at 0x139ca51f0> └ typing.Union File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) │ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'} │ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,) │ │ └ Dataset({ │ │ features: ['instruction', 'input', 'output', 'djstats'], │ │ num_rows: 1000 │ │ }) │ └ <function Dataset.map at 0x139ca5160> └ typing.Union File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map for rank, done, content in iflatmap_unordered( │ │ │ └ <function iflatmap_unordered at 0x1394380d0> │ │ └ 0 │ └ False └ 3 File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered [async_result.get() for async_result in async_results] └ [<multiprocess.pool.ApplyResult object at 0x16f839e50>, <multiprocess.pool.ApplyResult object at 0x16f850790>, <multiprocess.... (data_juicer) guangshengliu@MacBook-Air data % cd data-juicer (data_juicer) guangshengliu@MacBook-Air data-juicer % python tools/analyze_data.py --config configs/demo/analyser.yaml 2024-04-15 16:20:47.909 | DEBUG | data_juicer.utils.availability_utils:_is_package_available:116 - Detected torch version 2.2.2 2024-04-15 16:20:48.682 | INFO | data_juicer:setup_mp:58 - Setting multiprocess start method to 'fork'. 2024-04-15 16:20:48.682 | DEBUG | data_juicer:setup_cuda:72 - _USE_CUDA: False | MP: fork (MainProcess) 2024-04-15 16:20:51 | INFO | data_juicer.config.config:533 - Back up the input config file [/Users/guangshengliu/LLM/data/data-juicer/configs/demo/analyser.yaml] into the work_dir [/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser] 2024-04-15 16:20:51 | INFO | data_juicer.config.config:554 - Configuration table: ╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕ │ key │ values │ ╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡ │ config │ [Path_fr(configs/demo/analyser.yaml, cwd=/Users/guangshengliu/LLM/data/data-juicer)] │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ hpo_config │ None │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ path_3sigma_recipe │ None │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ project_name │ 'demo-analyser' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ executor_type │ 'default' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ dataset_path │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data/aqua_train.json' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ export_path │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser/demo-analyser-result.jsonl' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ export_shard_size │ 0 │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ export_in_parallel │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ keep_stats_in_res_ds │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ keep_hashes_in_res_ds │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ np │ 4 │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ text_keys │ ['instruction', 'output'] │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ image_key │ 'images' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ image_special_token │ '<dj__image>' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ audio_key │ 'audios' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ audio_special_token │ '<djaudio>' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ video_key │ 'videos' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ video_special_token │ '<__dj__video>' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ eoc_special_token │ '<|djeoc|>' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ suffixes │ [] │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ use_cache │ True │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ ds_cache_dir │ '/Users/guangshengliu/.cache/huggingface/datasets' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ cache_compress │ None │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ use_checkpoint │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ temp_dir │ None │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ open_tracer │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ op_list_to_trace │ [] │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ trace_num │ 10 │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ op_fusion │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ process │ [{'language_id_score_filter': {'accelerator': 'cpu', │ │ │ 'audio_key': 'audios', │ │ │ 'cpu_required': 1, │ │ │ 'image_key': 'images', │ │ │ 'lang': 'zh', │ │ │ 'mem_required': 0, │ │ │ 'min_score': 0.8, │ │ │ 'spec_numprocs': 0, │ │ │ 'text_key': 'text', │ │ │ 'use_actor': False, │ │ │ 'video_key': 'videos'}}, │ │ │ {'perplexity_filter': {'accelerator': 'cpu', │ │ │ 'audio_key': 'audios', │ │ │ 'cpu_required': 1, │ │ │ 'image_key': 'images', │ │ │ 'lang': 'zh', │ │ │ 'max_ppl': 1500, │ │ │ 'mem_required': 0, │ │ │ 'spec_numprocs': 0, │ │ │ 'text_key': 'text', │ │ │ 'use_actor': False, │ │ │ 'video_key': 'videos'}}] │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ save_stats_in_one_file │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ ray_address │ 'auto' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ work_dir │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ timestamp │ '20240415162051' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ dataset_dir │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ add_suffix │ False │ ╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛ 2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:39 - Using cache compression method: [None] 2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:44 - Setting up data formatter... 2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:53 - Preparing exporter... 2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:75 - Loading dataset from data formatter... 2024-04-15 16:20:53 | INFO | datasets.load:1791 - Downloading and preparing dataset json/default to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e... Downloading data files: 100%|##########| 1/1 [00:00<00:00, 5377.31it/s] Extracting data files: 100%|##########| 1/1 [00:00<00:00, 484.95it/s] 2024-04-15 16:20:53 | INFO | logging:952 - Setting num_proc from 4 back to 1 for the json split to disable multiprocessing as it only contains one shard. 2024-04-15 16:20:53 | INFO | datasets.load:1791 - Dataset json downloaded and prepared to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data. 100%|##########| 1/1 [00:00<00:00, 352.61it/s] 2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats... 2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:200 - There are 2728 sample(s) in the original dataset. 2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:214 - 2728 samples left after filtering empty text. 2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:136 - sampled 2728 from 2728 2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:142 - There are 2728 in final dataset 2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:81 - Preparing process operators... 2024-04-15 16:20:53 | INFO | data_juicer.utils.model_utils:102 - Loading fasttext language identification model... Warning : load_model does not return WordVectorModel or SupervisedModel any more, but a FastText object which is very similar. 2024-04-15 16:20:53 | WARNING | data_juicer.ops.load:24 - This OP [perplexity_filter] is unavailable due to importing third-party requirements of this OP failure: ['sentencepiece', 'kenlm']. You can either run pip install -v -e .[sci] to install all requirements for all OPs, or run pip install sentencepiece kenlm with library version specified by environments/science_requires.txt to install libraries required by this OP. Data processing will skip this OP later. 2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:86 - Computing the stats of dataset... 2024-04-15 16:20:53 | ERROR | main__:13 - An error has been caught in function '', process 'MainProcess' (84118), thread 'MainThread' (8040734208): multiprocess.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(*args, kwds)) File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue for i, result in enumerate(func(kwargs)): File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single example = apply_function_on_filtered_inputs(example, i, offset=offset) File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, additional_args, fn_kwargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f return f(*args, *kargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f return f(args, **kargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats text = sample[self.text_key].lower().replace('\n', ' ') File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem value = self.data[key] KeyError: 'text' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in main() └ <function main at 0x124c03dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main analyser.run() │ └ <function Analyser.run at 0x291876f70> └ <data_juicer.core.analyser.Analyser object at 0x10405c250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run dataset = dataset.map(op.compute_stats, │ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x291ebea60> │ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16d487d60> │ └ <function NestedDataset.map at 0x2918764c0> └ Dataset({ features: ['instruction', 'input', 'output', 'djstats__'], num_rows: 2728 })

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map new_ds = NestedDataset(super().map(*args, **kargs)) │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'} │ └ [<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>] └ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) │ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'} │ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,) │ │ └ Dataset({ │ │ features: ['instruction', 'input', 'output', 'djstats__'], │ │ num_rows: 2728 │ │ }) │ └ <function Dataset.map at 0x13fad51f0> └ typing.Union File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) │ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'} │ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,) │ │ └ Dataset({ │ │ features: ['instruction', 'input', 'output', 'djstats__'], │ │ num_rows: 2728 │ │ }) │ └ <function Dataset.map at 0x13fad5160> └ typing.Union File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map for rank, done, content in iflatmap_unordered( │ │ │ └ <function iflatmap_unordered at 0x13f2780d0> │ │ └ 0 │ └ False └ 3 File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered [async_result.get() for async_result in async_results] └ [<multiprocess.pool.ApplyResult object at 0x291e2a670>, <multiprocess.pool.ApplyResult object at 0x291e2a790>, <multiprocess.... File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in [async_result.get() for async_result in async_results] │ │ └ <multiprocess.pool.ApplyResult object at 0x291e2a670> │ └ <function ApplyResult.get at 0x13f2764c0> └ <multiprocess.pool.ApplyResult object at 0x291e2a670> File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get raise self._value │ └ KeyError('text') └ <multiprocess.pool.ApplyResult object at 0x291e2a670>

KeyError: 'text'

To Reproduce 如何复现

2024-04-15 16:15:39 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (83335), thread 'MainThread' (8040734208): multiprocess.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(args, kwds)) File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue for i, result in enumerate(func(kwargs)): File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single example = apply_function_on_filtered_inputs(example, i, offset=offset) File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, *additional_args, fn_kwargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f return f(*args, *kargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f return f(args, kargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats text = sample[self.text_key].lower().replace('\n', ' ') File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem value = self.data[key] KeyError: 'text' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in main() └ <function main at 0x122843dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main analyser.run() │ └ <function Analyser.run at 0x17edb4f70> └ <data_juicer.core.analyser.Analyser object at 0x101d91250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run dataset = dataset.map(op.compute_stats, │ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x16f77a280> │ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16f77fac0> │ └ <function NestedDataset.map at 0x17edb44c0> └ Dataset({ features: ['instruction', 'input', 'output', 'djstats__'], num_rows: 1000 })

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map new_ds = NestedDataset(super().map(*args, **kargs)) │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'} │ └ [<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>] └ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) │ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'} │ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,) │ │ └ Dataset({ │ │ features: ['instruction', 'input', 'output', 'djstats'], │ │ num_rows: 1000 │ │ }) │ └ <function Dataset.map at 0x139ca51f0> └ typing.Union File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) │ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'} │ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,) │ │ └ Dataset({ │ │ features: ['instruction', 'input', 'output', 'djstats'], │ │ num_rows: 1000 │ │ }) │ └ <function Dataset.map at 0x139ca5160> └ typing.Union File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map for rank, done, content in iflatmap_unordered( │ │ │ └ <function iflatmap_unordered at 0x1394380d0> │ │ └ 0 │ └ False └ 3 File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered [async_result.get() for async_result in async_results] └ [<multiprocess.pool.ApplyResult object at 0x16f839e50>, <multiprocess.pool.ApplyResult object at 0x16f850790>, <multiprocess.... (data_juicer) guangshengliu@MacBook-Air data % cd data-juicer (data_juicer) guangshengliu@MacBook-Air data-juicer % python tools/analyze_data.py --config configs/demo/analyser.yaml 2024-04-15 16:20:47.909 | DEBUG | data_juicer.utils.availability_utils:_is_package_available:116 - Detected torch version 2.2.2 2024-04-15 16:20:48.682 | INFO | data_juicer:setup_mp:58 - Setting multiprocess start method to 'fork'. 2024-04-15 16:20:48.682 | DEBUG | data_juicer:setup_cuda:72 - _USE_CUDA: False | MP: fork (MainProcess) 2024-04-15 16:20:51 | INFO | data_juicer.config.config:533 - Back up the input config file [/Users/guangshengliu/LLM/data/data-juicer/configs/demo/analyser.yaml] into the work_dir [/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser] 2024-04-15 16:20:51 | INFO | data_juicer.config.config:554 - Configuration table: ╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕ │ key │ values │ ╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡ │ config │ [Path_fr(configs/demo/analyser.yaml, cwd=/Users/guangshengliu/LLM/data/data-juicer)] │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ hpo_config │ None │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ path_3sigma_recipe │ None │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ project_name │ 'demo-analyser' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ executor_type │ 'default' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ dataset_path │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data/aqua_train.json' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ export_path │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser/demo-analyser-result.jsonl' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ export_shard_size │ 0 │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ export_in_parallel │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ keep_stats_in_res_ds │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ keep_hashes_in_res_ds │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ np │ 4 │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ text_keys │ ['instruction', 'output'] │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ image_key │ 'images' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ image_special_token │ '<dj__image>' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ audio_key │ 'audios' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ audio_special_token │ '<djaudio>' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ video_key │ 'videos' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ video_special_token │ '<__dj__video>' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ eoc_special_token │ '<|djeoc|>' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ suffixes │ [] │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ use_cache │ True │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ ds_cache_dir │ '/Users/guangshengliu/.cache/huggingface/datasets' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ cache_compress │ None │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ use_checkpoint │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ temp_dir │ None │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ open_tracer │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ op_list_to_trace │ [] │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ trace_num │ 10 │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ op_fusion │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ process │ [{'language_id_score_filter': {'accelerator': 'cpu', │ │ │ 'audio_key': 'audios', │ │ │ 'cpu_required': 1, │ │ │ 'image_key': 'images', │ │ │ 'lang': 'zh', │ │ │ 'mem_required': 0, │ │ │ 'min_score': 0.8, │ │ │ 'spec_numprocs': 0, │ │ │ 'text_key': 'text', │ │ │ 'use_actor': False, │ │ │ 'video_key': 'videos'}}, │ │ │ {'perplexity_filter': {'accelerator': 'cpu', │ │ │ 'audio_key': 'audios', │ │ │ 'cpu_required': 1, │ │ │ 'image_key': 'images', │ │ │ 'lang': 'zh', │ │ │ 'max_ppl': 1500, │ │ │ 'mem_required': 0, │ │ │ 'spec_numprocs': 0, │ │ │ 'text_key': 'text', │ │ │ 'use_actor': False, │ │ │ 'video_key': 'videos'}}] │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ save_stats_in_one_file │ False │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ ray_address │ 'auto' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ work_dir │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ timestamp │ '20240415162051' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ dataset_dir │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data' │ ├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤ │ add_suffix │ False │ ╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛ 2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:39 - Using cache compression method: [None] 2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:44 - Setting up data formatter... 2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:53 - Preparing exporter... 2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:75 - Loading dataset from data formatter... 2024-04-15 16:20:53 | INFO | datasets.load:1791 - Downloading and preparing dataset json/default to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e... Downloading data files: 100%|##########| 1/1 [00:00<00:00, 5377.31it/s] Extracting data files: 100%|##########| 1/1 [00:00<00:00, 484.95it/s] 2024-04-15 16:20:53 | INFO | logging:952 - Setting num_proc from 4 back to 1 for the json split to disable multiprocessing as it only contains one shard. 2024-04-15 16:20:53 | INFO | datasets.load:1791 - Dataset json downloaded and prepared to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data. 100%|##########| 1/1 [00:00<00:00, 352.61it/s] 2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats... 2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:200 - There are 2728 sample(s) in the original dataset. 2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:214 - 2728 samples left after filtering empty text. 2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:136 - sampled 2728 from 2728 2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:142 - There are 2728 in final dataset 2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:81 - Preparing process operators... 2024-04-15 16:20:53 | INFO | data_juicer.utils.model_utils:102 - Loading fasttext language identification model... Warning : load_model does not return WordVectorModel or SupervisedModel any more, but a FastText object which is very similar. 2024-04-15 16:20:53 | WARNING | data_juicer.ops.load:24 - This OP [perplexity_filter] is unavailable due to importing third-party requirements of this OP failure: ['sentencepiece', 'kenlm']. You can either run pip install -v -e .[sci] to install all requirements for all OPs, or run pip install sentencepiece kenlm with library version specified by environments/science_requires.txt to install libraries required by this OP. Data processing will skip this OP later. 2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:86 - Computing the stats of dataset... 2024-04-15 16:20:53 | ERROR | main__:13 - An error has been caught in function '', process 'MainProcess' (84118), thread 'MainThread' (8040734208): multiprocess.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(*args, kwds)) File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue for i, result in enumerate(func(kwargs)): File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single example = apply_function_on_filtered_inputs(example, i, offset=offset) File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, additional_args, fn_kwargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f return f(*args, *kargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f return f(args, **kargs) File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats text = sample[self.text_key].lower().replace('\n', ' ') File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem value = self.data[key] KeyError: 'text' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in main() └ <function main at 0x124c03dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main analyser.run() │ └ <function Analyser.run at 0x291876f70> └ <data_juicer.core.analyser.Analyser object at 0x10405c250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run dataset = dataset.map(op.compute_stats, │ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x291ebea60> │ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16d487d60> │ └ <function NestedDataset.map at 0x2918764c0> └ Dataset({ features: ['instruction', 'input', 'output', 'djstats__'], num_rows: 2728 })

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map new_ds = NestedDataset(super().map(*args, **kargs)) │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'} │ └ [<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>] └ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) │ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'} │ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,) │ │ └ Dataset({ │ │ features: ['instruction', 'input', 'output', 'djstats__'], │ │ num_rows: 2728 │ │ }) │ └ <function Dataset.map at 0x13fad51f0> └ typing.Union File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) │ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'} │ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,) │ │ └ Dataset({ │ │ features: ['instruction', 'input', 'output', 'djstats__'], │ │ num_rows: 2728 │ │ }) │ └ <function Dataset.map at 0x13fad5160> └ typing.Union File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map for rank, done, content in iflatmap_unordered( │ │ │ └ <function iflatmap_unordered at 0x13f2780d0> │ │ └ 0 │ └ False └ 3 File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered [async_result.get() for async_result in async_results] └ [<multiprocess.pool.ApplyResult object at 0x291e2a670>, <multiprocess.pool.ApplyResult object at 0x291e2a790>, <multiprocess.... File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in [async_result.get() for async_result in async_results] │ │ └ <multiprocess.pool.ApplyResult object at 0x291e2a670> │ └ <function ApplyResult.get at 0x13f2764c0> └ <multiprocess.pool.ApplyResult object at 0x291e2a670> File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get raise self._value │ └ KeyError('text') └ <multiprocess.pool.ApplyResult object at 0x291e2a670>

KeyError: 'text'

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

No response

HYLcool commented 5 months ago

嗨 @promisecc

感谢你对data-juicer的关注与使用~

注意到你的待分析的数据集中包括以下三个文本字段:['instruction', 'input', 'output'],虽然你设置了text_keys为['instruction', 'output'],但算子的text_key依然为'text',请你检查一下是不是单独为算子设置了text_key参数为'text',如是的话可以把算子中的text_key参数设置移除,这样就能继承使用全局的text_keys设置了。

此外如果方便的话,你也可以分享一下你的配置文件内容,这有利于我们进一步帮助你定位问题~

HYLcool commented 5 months ago

Closed by PR #300 fixed by @shiweijiezero . Thanks!👍🏻