A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
[X] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
[X] I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。
OS 系统
ubuntu
Installation Method 安装方式
source
Data-Juicer Version Data-Juicer版本
v0.2.0
Python Version Python版本
3.9.19
Describe the bug 描述这个bug
执行 python -m tests.core.test_adapter 报错
To Reproduce 如何复现
在项目根目录执行 python -m tests.core.test_adapter
出现报错,经过定位应该是在Filter.run的这段代码
dataset = dataset.map(add_same_content_to_new_column, fn_kwargs={ 'new_column_name': Fields.stats, 'initial_value': {} }, num_proc=self.runtime_np(), batch_size=self.batch_size, desc='Adding new column for stats')
中的initial_value有问题,是个空字典,这个在PerplexityFilter算子中compute_stats中的 for idx, stat in enumerate(samples_stats) 进不去循环,没执行计算,后面报错KeyError: 'perplexity',参考其他test例子把 'initial_value': {}替换成了 'initial_value': [{}] * dataset.num_rows(ps:不知道要不要乘以这个rows),后执行,PerplexityFilter算子不再报错
继续执行,PerplexityFilter算子不再报错,但是DocumentDeduplicator算子报错,信息大概为 File "/root/data-juicer/data-juicer/data_juicer/ops/deduplicator/document_deduplicator.py", line 63, in _get_hash
return hashlib.md5(txt.strip().encode('utf-8')).hexdigest()
AttributeError: 'list' object has no attribute 'strip',看了下代码,是因为前置的FixUnicodeMapper算子处理完数据后,
samples[self.text_key] = list( map( lambda text: ftfy.fix_text(text, normalization=self.normalization), samples[self.text_key]))
samples[self.text_key]是一个数组,导致DocumentDeduplicator算子执行_get_hash处理时报错
看了下其他mapper算子,貌似输出的samples[self.text_key]有许多格式,数组,字典,字符串都有,但是strip应该只支持字符串,是不是这些算子之间的兼容性处理的不够好,其他算子是否也有类似问题
麻烦有空帮忙解答下,感谢!
Configs 配置信息
No response
Logs 报错日志
Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 17, in resource_monitor
if mdict['stop']:
File "", line 2, in getitem
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
ERROR: test_execute_and_probe (main.AdapterTest)
Traceback (most recent call last):
File "/root/data-juicer/data-juicer/tests/core/test_adapter.py", line 126, in test_execute_and_probe
resource_util_list = Adapter.execute_and_probe(ds, ops)
File "/root/data-juicer/data-juicer/data_juicer/core/adapter.py", line 42, in execute_and_probe
dataset, resource_util_per_op = Monitor.monitor_func(
File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 201, in monitor_func
ret = func()
File "/root/data-juicer/data-juicer/data_juicer/ops/base_op.py", line 318, in run
new_dataset = dataset.filter(self.process,
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/fingerprint.py", line 482, in wrapper
out = func(dataset, *args, *kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3709, in filter
indices = self.map(
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3156, in map
for rank, done, content in Dataset._map_single(dataset_kwargs):
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3547, in _map_single
batch = apply_function_on_filtered_inputs(
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
processed_inputs = function(fn_args, *additional_args, fn_kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 6477, in get_indices_from_mask_function
mask.append(function(example, *additional_args, *fn_kwargs))
File "/root/data-juicer/data-juicer/data_juicer/core/data.py", line 72, in wrapped_f
return f(args, kargs)
File "/root/data-juicer/data-juicer/data_juicer/ops/filter/perplexity_filter.py", line 87, in process
return samples[Fields.stats][StatsKeys.perplexity] <= self.max_ppl
KeyError: 'perplexity'
Before Reporting 报告之前
[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
[X] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
ubuntu
Installation Method 安装方式
source
Data-Juicer Version Data-Juicer版本
v0.2.0
Python Version Python版本
3.9.19
Describe the bug 描述这个bug
执行 python -m tests.core.test_adapter 报错
To Reproduce 如何复现
dataset = dataset.map(add_same_content_to_new_column, fn_kwargs={ 'new_column_name': Fields.stats, 'initial_value': {} }, num_proc=self.runtime_np(), batch_size=self.batch_size, desc='Adding new column for stats')
中的initial_value有问题,是个空字典,这个在PerplexityFilter算子中compute_stats中的 for idx, stat in enumerate(samples_stats) 进不去循环,没执行计算,后面报错KeyError: 'perplexity',参考其他test例子把 'initial_value': {}替换成了 'initial_value': [{}] * dataset.num_rows(ps:不知道要不要乘以这个rows),后执行,PerplexityFilter算子不再报错samples[self.text_key] = list( map( lambda text: ftfy.fix_text(text, normalization=self.normalization), samples[self.text_key]))
samples[self.text_key]是一个数组,导致DocumentDeduplicator算子执行_get_hash处理时报错 看了下其他mapper算子,貌似输出的samples[self.text_key]有许多格式,数组,字典,字符串都有,但是strip应该只支持字符串,是不是这些算子之间的兼容性处理的不够好,其他算子是否也有类似问题Configs 配置信息
No response
Logs 报错日志
Traceback (most recent call last): File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 17, in resource_monitor if mdict['stop']: File "", line 2, in getitem
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
ERROR: test_execute_and_probe (main.AdapterTest)
Traceback (most recent call last): File "/root/data-juicer/data-juicer/tests/core/test_adapter.py", line 126, in test_execute_and_probe resource_util_list = Adapter.execute_and_probe(ds, ops) File "/root/data-juicer/data-juicer/data_juicer/core/adapter.py", line 42, in execute_and_probe dataset, resource_util_per_op = Monitor.monitor_func( File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 201, in monitor_func ret = func() File "/root/data-juicer/data-juicer/data_juicer/ops/base_op.py", line 318, in run new_dataset = dataset.filter(self.process, File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/fingerprint.py", line 482, in wrapper out = func(dataset, *args, *kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3709, in filter indices = self.map( File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 602, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3156, in map for rank, done, content in Dataset._map_single(dataset_kwargs): File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3547, in _map_single batch = apply_function_on_filtered_inputs( File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, *additional_args, fn_kwargs) File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 6477, in get_indices_from_mask_function mask.append(function(example, *additional_args, *fn_kwargs)) File "/root/data-juicer/data-juicer/data_juicer/core/data.py", line 72, in wrapped_f return f(args, kargs) File "/root/data-juicer/data-juicer/data_juicer/ops/filter/perplexity_filter.py", line 87, in process return samples[Fields.stats][StatsKeys.perplexity] <= self.max_ppl KeyError: 'perplexity'
Screenshots 截图
No response
Additional 额外信息
No response