modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.63k stars 166 forks source link

[Bug]: #441

Open FailedNamed opened 1 week ago

FailedNamed commented 1 week ago

Before Reporting 报告之前

Search before reporting 先搜索,再报告

OS 系统

ubuntu

Installation Method 安装方式

source

Data-Juicer Version Data-Juicer版本

v0.2.0

Python Version Python版本

3.9.19

Describe the bug 描述这个bug

执行 python -m tests.core.test_adapter 报错

To Reproduce 如何复现

  1. 在项目根目录执行 python -m tests.core.test_adapter
  2. 出现报错,经过定位应该是在Filter.run的这段代码 dataset = dataset.map(add_same_content_to_new_column, fn_kwargs={ 'new_column_name': Fields.stats, 'initial_value': {} }, num_proc=self.runtime_np(), batch_size=self.batch_size, desc='Adding new column for stats') 中的initial_value有问题,是个空字典,这个在PerplexityFilter算子中compute_stats中的 for idx, stat in enumerate(samples_stats) 进不去循环,没执行计算,后面报错KeyError: 'perplexity',参考其他test例子把 'initial_value': {}替换成了 'initial_value': [{}] * dataset.num_rows(ps:不知道要不要乘以这个rows),后执行,PerplexityFilter算子不再报错
  3. 继续执行,PerplexityFilter算子不再报错,但是DocumentDeduplicator算子报错,信息大概为 File "/root/data-juicer/data-juicer/data_juicer/ops/deduplicator/document_deduplicator.py", line 63, in _get_hash return hashlib.md5(txt.strip().encode('utf-8')).hexdigest() AttributeError: 'list' object has no attribute 'strip',看了下代码,是因为前置的FixUnicodeMapper算子处理完数据后, samples[self.text_key] = list( map( lambda text: ftfy.fix_text(text, normalization=self.normalization), samples[self.text_key])) samples[self.text_key]是一个数组,导致DocumentDeduplicator算子执行_get_hash处理时报错 看了下其他mapper算子,貌似输出的samples[self.text_key]有许多格式,数组,字典,字符串都有,但是strip应该只支持字符串,是不是这些算子之间的兼容性处理的不够好,其他算子是否也有类似问题
  4. 麻烦有空帮忙解答下,感谢!

Configs 配置信息

No response

Logs 报错日志

Traceback (most recent call last): File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 17, in resource_monitor if mdict['stop']: File "", line 2, in getitem File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod conn.send((self._id, methodname, args, kwds)) File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

ERROR: test_execute_and_probe (main.AdapterTest)

Traceback (most recent call last): File "/root/data-juicer/data-juicer/tests/core/test_adapter.py", line 126, in test_execute_and_probe resource_util_list = Adapter.execute_and_probe(ds, ops) File "/root/data-juicer/data-juicer/data_juicer/core/adapter.py", line 42, in execute_and_probe dataset, resource_util_per_op = Monitor.monitor_func( File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 201, in monitor_func ret = func() File "/root/data-juicer/data-juicer/data_juicer/ops/base_op.py", line 318, in run new_dataset = dataset.filter(self.process, File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/fingerprint.py", line 482, in wrapper out = func(dataset, *args, *kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3709, in filter indices = self.map( File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 602, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3156, in map for rank, done, content in Dataset._map_single(dataset_kwargs): File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3547, in _map_single batch = apply_function_on_filtered_inputs( File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, *additional_args, fn_kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 6477, in get_indices_from_mask_function mask.append(function(example, *additional_args, *fn_kwargs)) File "/root/data-juicer/data-juicer/data_juicer/core/data.py", line 72, in wrapped_f return f(args, kargs) File "/root/data-juicer/data-juicer/data_juicer/ops/filter/perplexity_filter.py", line 87, in process return samples[Fields.stats][StatsKeys.perplexity] <= self.max_ppl KeyError: 'perplexity'

Screenshots 截图

No response

Additional 额外信息

No response