A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
[X] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
[X] I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
Traceback (most recent call last): File "/home/likang/angang_data_clean/data-juicer-main/tools/process_data.py", line 3, in
from data_juicer.config import init_configs
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/init.py", line 1, in
from .config import (export_config, get_init_configs, init_configs,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/config.py", line 17, in
from data_juicer.ops.base_op import OPERATORS
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/init.py", line 1, in
from . import deduplicator, filter, mapper, selector
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/deduplicator/init.py", line 1, in
from . import (document_deduplicator, document_minhash_deduplicator,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/deduplicator/document_deduplicator.py", line 14, in
from ..base_op import OPERATORS, Deduplicator
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/base_op.py", line 5, in
import pyarrow as pa
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/pyarrow/init.py", line 65, in
import pyarrow.lib as _lib
AttributeError: _ARRAY_API not found
Traceback (most recent call last):
File "/home/likang/angang_data_clean/data-juicer-main/tools/process_data.py", line 3, in
from data_juicer.config import init_configs
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/init.py", line 1, in
from .config import (export_config, get_init_configs, init_configs,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/config.py", line 17, in
from data_juicer.ops.base_op import OPERATORS
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/init.py", line 1, in
from . import deduplicator, filter, mapper, selector
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/deduplicator/init.py", line 1, in
from . import (document_deduplicator, document_minhash_deduplicator,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/deduplicator/document_deduplicator.py", line 14, in
from ..base_op import OPERATORS, Deduplicator
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/base_op.py", line 5, in
import pyarrow as pa
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/pyarrow/init.py", line 65, in
import pyarrow.lib as _lib
File "pyarrow/lib.pyx", line 36, in init pyarrow.lib
ImportError: numpy.core.multiarray failed to import
python tools/process_data.py --config configs/demo/process.yaml
Traceback (most recent call last):
File "/home/likang/angang_data_clean/data-juicer-main/tools/process_data.py", line 3, in
from data_juicer.config import init_configs
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/init.py", line 1, in
from .config import (export_config, get_init_configs, init_configs,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/config.py", line 17, in
from data_juicer.ops.base_op import OPERATORS
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/init.py", line 1, in
from . import deduplicator, filter, mapper, selector
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/filter/init.py", line 2, in
from . import (alphanumeric_filter, audio_duration_filter,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/filter/video_tagging_from_frames_filter.py", line 8, in
from ..mapper.video_tagging_from_frames_mapper import \
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/mapper/init.py", line 2, in
from . import (audio_ffmpeg_wrapped_mapper, chinese_convert_mapper,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/mapper/extract_qa_mapper.py", line 16, in
import vllm # noqa: F401
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/vllm/init.py", line 3, in
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 6, in
from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/vllm/config.py", line 9, in
from vllm.utils import get_cpu_memory, is_hip
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/vllm/utils.py", line 8, in
from vllm._C import cuda_utils
ImportError: /home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/vllm/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
Before Reporting 报告之前
[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
[X] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
ubuntun 20
Installation Method 安装方式
source
Data-Juicer Version Data-Juicer版本
latest
Python Version Python版本
3.10
Describe the bug 描述这个bug
在python3.10 用pip install -v -e .[sci]安装的时候 没有问题 但是运行python tools/process_data.py --config configs/demo/process.yaml的时候出现以下报错: 卸载numpy2.0 改为1.26还是不行
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.2 as it may crash. To support both 1.x and 2.x versions of NumPy, modules must be compiled with NumPy 2.0. Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
If you are a user of the module, the easiest solution will be to downgrade to 'numpy<2' or try to upgrade the affected module. We expect that some modules will need time to support NumPy 2.
Traceback (most recent call last): File "/home/likang/angang_data_clean/data-juicer-main/tools/process_data.py", line 3, in
from data_juicer.config import init_configs
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/init.py", line 1, in
from .config import (export_config, get_init_configs, init_configs,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/config.py", line 17, in
from data_juicer.ops.base_op import OPERATORS
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/init.py", line 1, in
from . import deduplicator, filter, mapper, selector
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/deduplicator/init.py", line 1, in
from . import (document_deduplicator, document_minhash_deduplicator,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/deduplicator/document_deduplicator.py", line 14, in
from ..base_op import OPERATORS, Deduplicator
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/base_op.py", line 5, in
import pyarrow as pa
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/pyarrow/init.py", line 65, in
import pyarrow.lib as _lib
AttributeError: _ARRAY_API not found
Traceback (most recent call last):
File "/home/likang/angang_data_clean/data-juicer-main/tools/process_data.py", line 3, in
from data_juicer.config import init_configs
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/init.py", line 1, in
from .config import (export_config, get_init_configs, init_configs,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/config.py", line 17, in
from data_juicer.ops.base_op import OPERATORS
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/init.py", line 1, in
from . import deduplicator, filter, mapper, selector
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/deduplicator/init.py", line 1, in
from . import (document_deduplicator, document_minhash_deduplicator,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/deduplicator/document_deduplicator.py", line 14, in
from ..base_op import OPERATORS, Deduplicator
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/base_op.py", line 5, in
import pyarrow as pa
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/pyarrow/init.py", line 65, in
import pyarrow.lib as _lib
File "pyarrow/lib.pyx", line 36, in init pyarrow.lib
ImportError: numpy.core.multiarray failed to import
To Reproduce 如何复现
在python3.10 用pip install -v -e .[sci]安装的时候 没有问题 但是运行python tools/process_data.py --config configs/demo/process.yaml的时候出现以下报错: 卸载numpy2.0 改为1.26还是不行 错误变成ImportError: /home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/vllm/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
Configs 配置信息
cuda:print(torch.version) 2.4.0+cu121 A40 -8卡
Logs 报错日志
python tools/process_data.py --config configs/demo/process.yaml Traceback (most recent call last): File "/home/likang/angang_data_clean/data-juicer-main/tools/process_data.py", line 3, in
from data_juicer.config import init_configs
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/init.py", line 1, in
from .config import (export_config, get_init_configs, init_configs,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/config/config.py", line 17, in
from data_juicer.ops.base_op import OPERATORS
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/init.py", line 1, in
from . import deduplicator, filter, mapper, selector
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/filter/init.py", line 2, in
from . import (alphanumeric_filter, audio_duration_filter,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/filter/video_tagging_from_frames_filter.py", line 8, in
from ..mapper.video_tagging_from_frames_mapper import \
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/mapper/init.py", line 2, in
from . import (audio_ffmpeg_wrapped_mapper, chinese_convert_mapper,
File "/home/likang/angang_data_clean/data-juicer-main/data_juicer/ops/mapper/extract_qa_mapper.py", line 16, in
import vllm # noqa: F401
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/vllm/init.py", line 3, in
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 6, in
from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/vllm/config.py", line 9, in
from vllm.utils import get_cpu_memory, is_hip
File "/home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/vllm/utils.py", line 8, in
from vllm._C import cuda_utils
ImportError: /home/likang/miniconda3/envs/datajuicer/lib/python3.10/site-packages/vllm/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
Screenshots 截图
No response
Additional 额外信息
No response