modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.78k stars 168 forks source link

[Bug]: OverflowError: Python int too large to convert to C long #168

Closed simplew2011 closed 9 months ago

simplew2011 commented 9 months ago

Before Reporting 报告之前

Search before reporting 先搜索,再报告

OS 系统

ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

数据集:https://atp-modelzoo.oss-cn-hangzhou.aliyuncs.com/release/datasets/WuDaoCorpus2.0_base_sample.tgz

document_simhash_deduplicatornlpcda_zh_mapper算子同时出现时会报错

To Reproduce 如何复现

dj-process --config configs/demo/process.yaml

Configs 配置信息

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: 'temp/WuDaoCorpus2.0_base_sample'  # path to your dataset directory or file
np: 1  # number of subprocess to process your dataset
text_keys: 'content'
export_path: './outputs/demo-process/demo-processed.jsonl'

# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'zh'
      min_score: 0.8

  - document_simhash_deduplicator:                          # deduplicate text samples using SimHash-LSH method
      tokenization: character                                     # tokenization method for text. One of [space, punctuation, character]
      window_size: 6                                          # window size of shingling
      num_blocks: 10                                           # number of blocks in SimHash computing
      hamming_distance: 8                                     # the max hamming distance to regard 2 samples as similar enough pair. Should be less than num_blocks always

  - nlpcda_zh_mapper:                                       # simply augment texts in Chinese based on the nlpaug library
      sequential: false                                       # whether combine all augmentation methods to a sequence. If it's True, a sample will be augmented by all opened augmentation methods sequentially. If it's False, each opened augmentation method would generate its augmented samples independently.
      aug_num: 1                                              # number of augmented samples to be generated. If `sequential` is True, there will be total aug_num augmented samples generated. If it's False, there will be (aug_num * #opened_aug_method) augmented samples generated.
      swap_random_char: true 

Logs 报错日志

  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/executor.py", line 120, in run
    tmp = dataset.map(function=op.process,
  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 180, in map
    new_ds = NestedDataset(super().map(*args, **kargs))
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3004, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3397, in _map_single
    writer.write_batch(batch)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_writer.py", line 551, in write_batch
    arrays.append(pa.array(typed_sequence))
  File "pyarrow/array.pxi", line 243, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_writer.py", line 189, in __arrow_array__
    out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
  File "pyarrow/array.pxi", line 327, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
OverflowError: Python int too large to convert to C long

Screenshots 截图

No response

Additional 额外信息

应该和simhash值计算及arrow有关

pip list ``` about-time 4.2.1 accelerate 0.25.0 ago 0.0.95 aiofiles 23.2.1 aiohttp 3.8.6 aiosignal 1.3.1 alabaster 0.7.13 albumentations 1.3.1 alive-progress 3.1.4 altair 5.1.2 antlr4-python3-runtime 4.9.3 anyio 3.7.1 appdirs 1.4.4 APScheduler 3.9.1 argcomplete 1.10.3 argos-translate-files 1.1.4 argostranslate 1.9.1 arxiv 2.0.0 arxiv-dl 1.1.5 arXiv-download 0.1 astunparse 1.6.3 async-timeout 4.0.3 attrs 23.1.0 Automat 22.10.0 Babel 2.13.1 backports.zoneinfo 0.2.1 ballpark 1.4.0 beautifulsoup4 4.9.3 bert4torch 0.4.0 blinker 1.6.3 blis 0.7.11 boto3 1.28.73 botocore 1.31.73 bs4 0.0.1 cachelib 0.10.2 cachetools 5.3.2 catalogue 2.0.10 certifi 2022.12.7 cffi 1.16.0 cfgv 3.4.0 chardet 3.0.4 charset-normalizer 2.1.1 click 8.1.7 cloudpickle 3.0.0 cmake 3.25.0 colorama 0.4.6 coloredlogs 15.0.1 colorlog 6.7.0 commonmark 0.9.1 compressed-rtf 1.0.6 confection 0.1.3 constantly 23.10.4 contourpy 1.1.1 courlan 0.9.4 cryptography 41.0.5 cssselect 1.2.0 ctranslate2 3.20.0 cycler 0.12.1 cymem 2.0.8 Cython 3.0.6 dashscope 1.10.0 datasets 2.11.0 datasketch 1.6.4 dateparser 1.1.8 deep-translator 1.11.4 defusedxml 0.7.1 Deprecated 1.2.14 dill 0.3.4 distlib 0.3.7 distro 1.8.0 dl-translate 0.3.0 docker-pycreds 0.4.0 docopt 0.6.2 docstring-parser 0.15 docutils 0.18.1 docx2txt 0.8 dotmap 1.3.30 ebcdic 1.1.1 elastic-transport 8.10.0 elasticsearch 8.10.1 emoji 2.2.0 environs 9.5.0 et-xmlfile 1.1.0 exceptiongroup 1.1.3 expiringdict 1.2.2 extract-msg 0.28.7 fake-useragent 1.3.0 fastapi 0.105.0 fasttext-wheel 0.9.2 faust-cchardet 2.1.19 feedfinder2 0.0.4 feedparser 6.0.10 ffmpy 0.3.1 filelock 3.12.4 fire 0.5.0 flagdata 1.0.0 Flask 2.2.2 flask-babel 3.1.0 Flask-Limiter 2.6.3 Flask-Session 0.4.0 flask-swagger 0.2.14 flask-swagger-ui 4.11.1 flatbuffers 23.5.26 fonttools 4.43.1 frozenlist 1.4.0 fsspec 2023.3.0 ftfy 6.1.1 gdown 4.7.1 gevent 23.9.1 ghp-import 2.1.0 gitdb 4.0.10 GitPython 3.1.40 gne 0.3.0 google-trans-new 1.1.9 googletrans 4.0.0rc1 GPUtil 1.4.0 gradio 3.50.2 gradio_client 0.6.1 grapheme 0.6.0 greenlet 3.0.1 grpcio 1.59.2 h11 0.9.0 h2 3.2.0 h5py 3.10.0 hanziconv 0.3.2 hjson 3.1.0 hpack 3.0.0 hstspreload 2023.1.1 html5tagger 1.3.0 htmldate 1.5.2 httpcore 0.9.1 httptools 0.6.1 httpx 0.13.3 huggingface-hub 0.17.3 humanfriendly 10.0 hurry.filesize 0.9 hydra-core 1.3.2 hyperframe 5.2.0 hyperlink 21.0.0 identify 2.5.30 idna 2.10 imagededup 0.3.2 imageio 2.31.6 imagesize 1.4.1 IMAPClient 2.1.0 importlib-metadata 6.8.0 importlib-resources 6.1.0 incremental 22.10.0 install 1.3.5 itemadapter 0.8.0 itemloaders 1.1.0 itsdangerous 2.1.2 jieba 0.42.1 jieba3k 0.35.1 Jinja2 3.1.2 jiojio 1.2.5 jionlp 1.5.4 jmespath 1.0.1 joblib 1.3.2 jsonargparse 4.27.1 jsonlines 4.0.0 jsonschema 4.19.1 jsonschema-specifications 2023.7.1 jusText 3.0.0 kenlm 0.2.0 Keras 2.3.1 Keras-Applications 1.0.8 Keras-Preprocessing 1.1.2 kiwisolver 1.4.5 langcodes 3.3.0 langdetect 1.0.9 langid 1.1.6 lazy_loader 0.3 Levenshtein 0.23.0 LexiLang 1.0.1 libretranslate 1.5.2 libretranslatepy 2.1.3 lightning 2.1.0 lightning-utilities 0.9.0 limits 3.7.0 lingua-language-detector 2.0.0 linkify-it-py 2.0.2 lit 15.0.7 livereload 2.6.3 llvmlite 0.41.1 loguru 0.7.2 lxml 4.9.3 lz4 4.3.2 Markdown 3.5.1 markdown-it-py 3.0.0 markdown2 2.4.11 MarkupSafe 2.1.2 marshmallow 3.20.1 matplotlib 3.7.3 mdit-py-plugins 0.4.0 mdurl 0.1.2 memray 1.11.0 mergedeep 1.3.4 mkdocs 1.5.3 mkdocs-material-extensions 1.3.1 mlscraper 0.1.2 more-itertools 10.1.0 Morfessor 2.0.6 mpmath 1.3.0 msgpack 1.0.7 multidict 6.0.4 multiprocess 0.70.12 munch 4.0.0 murmurhash 1.0.10 networkx 3.0 news-please 1.5.35 newspaper3k 0.2.8 newspaper3kli 0.1.0 nh3 0.2.15 nicefid 2.1.1 nlpaug 1.1.11 nlpcda 2.5.8 nltk 3.8.1 nodeenv 1.8.0 Nuitka 2.0rc6 numba 0.58.1 numpy 1.24.1 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 olefile 0.46 omegaconf 2.3.0 onnx 1.15.0 onnxruntime 1.16.3 onnxsim 0.4.35 openai 0.28.0 OpenCC 1.1.6 opencv-python-headless 4.8.1.78 openpyxl 3.1.2 ordered-set 4.1.0 orjson 3.9.10 outcome 1.3.0.post0 packaging 23.1 paginate 0.5.6 pandas 2.0.0 parse 1.19.1 parsel 1.8.1 pathos 0.3.1 pathspec 0.12.1 pathtools 0.1.2 pathy 0.10.3 patsy 0.5.3 pdfmajor 1.3.13 pdfminer.six 20221105 pdfplumber 0.10.2 pdfsyntax 0.0.7 pdfx 1.4.1 peft 0.7.0 Pillow 9.3.0 pip 23.3.2 pkgutil_resolve_name 1.3.10 plac 1.4.1 platformdirs 3.11.0 plotly 5.18.0 polib 1.1.1 polyglot 16.7.4 pox 0.3.3 ppft 1.7.6.7 pre-commit 3.5.0 preshed 3.0.9 prettytable 3.9.0 prometheus-client 0.15.0 prompt-toolkit 3.0.41 Protego 0.3.0 protobuf 4.24.4 psutil 5.9.6 psycopg2-binary 2.9.9 py-data-juicer 0.1.2 /home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages py-spy 0.3.14 py4j 0.10.9.7 pyarrow 12.0.0 pyasn1 0.5.0 pyasn1-modules 0.3.0 pybind11 2.11.1 pycld2 0.41 pycorrector 1.0.0 pycparser 2.21 pycryptodome 3.8.2 pydantic 1.10.13 pydeck 0.8.1b0 PyDispatcher 2.0.7 pydub 0.25.1 pyee 8.2.2 PyExecJS 1.5.1 pyfreeproxy 0.1.4 Pygments 2.16.1 pygoogletranslation 2.0.6 pyhostman 0.1.3 PyICU 2.12 pymdown-extensions 10.5 PyMySQL 1.1.0 pynvml 11.4.1 pyOpenSSL 23.3.0 pypandoc 1.12 pyparsing 3.1.1 pypdf 3.17.0 PyPDF2 3.0.1 pypdfium2 4.22.0 pyphen 0.14.0 pypinyin 0.49.0 pyppeteer 1.0.2 pyquery 2.0.0 PySocks 1.7.1 pyspark 3.5.0 python-dateutil 2.8.2 python-docx 1.0.1 python-dotenv 1.0.0 python-hosts 1.0.5 python-Levenshtein 0.23.0 python-multipart 0.0.6 python-pptx 0.6.22 pytorch-lightning 2.0.6 pytz 2023.3.post1 PyWavelets 1.4.1 PyYAML 6.0.1 pyyaml_env_tag 0.1 qudida 0.0.4 queuelib 1.6.2 rapidfuzz 3.4.0 ray 2.9.0 readability 0.3.1 readability-lxml 0.8.1 recommonmark 0.7.1 redis 4.3.4 referencing 0.30.2 regex 2023.10.3 requests 2.28.1 requests-file 1.5.1 requests-html 0.10.0 resize-right 0.0.2 responses 0.18.0 rfc3986 1.5.0 rich 12.6.0 rjieba 0.1.11 roformer 0.4.3 rpds-py 0.10.6 ruamel.yaml 0.18.3 ruamel.yaml.clib 0.2.8 s3transfer 0.7.0 sacremoses 0.0.53 safetensors 0.4.0 sanic 23.6.0 sanic-routing 23.6.0 scalene 1.5.31.1 schedule 1.2.1 scikit-image 0.21.0 scikit-learn 1.3.2 scipdf 0.1.dev0 scipy 1.10.1 sconf 0.2.5 scrapeasy 0.12 Scrapy 2.11.0 selectolax 0.3.17 selenium 4.14.0 semantic-version 2.10.0 sentencepiece 0.1.99 sentry-sdk 1.32.0 service-identity 23.1.0 setproctitle 1.3.3 setuptools 68.0.0 sgmllib3k 1.0.0 shortuuid 1.0.11 simhash-py 0.4.0 six 1.12.0 smart-open 6.4.0 smmap 5.0.1 sniffio 1.3.0 snowballstemmer 2.2.0 sortedcontainers 2.4.0 soupsieve 2.5 spacy 3.5.0 spacy-legacy 3.0.12 spacy-loggers 1.0.5 spacy-pkuseg 0.0.33 SpeechRecognition 3.8.1 Sphinx 7.1.2 sphinx-autobuild 2021.3.14 sphinx-rtd-theme 1.3.0 sphinxcontrib-applehelp 1.0.4 sphinxcontrib-devhelp 1.0.2 sphinxcontrib-htmlhelp 2.0.1 sphinxcontrib-jquery 4.1 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.3 sphinxcontrib-serializinghtml 1.1.5 SQLAlchemy 2.0.23 srsly 2.4.8 stanza 1.1.1 starlette 0.27.0 statsmodels 0.14.0 streamlit 1.27.2 svgwrite 1.4.3 sympy 1.12 tabulate 0.9.0 tblib 3.0.0 tenacity 8.2.3 termcolor 2.3.0 textstat 0.7.3 textual 0.46.0 thinc 8.1.12 threadpoolctl 3.2.0 tifffile 2023.7.10 tiktoken 0.5.1 timm 0.5.4 tinysegmenter 0.3 tld 0.13 tldextract 5.0.1 tokenizers 0.15.0 toml 0.10.2 toolz 0.12.0 torch 2.0.1 torch-ema 0.3 torch4keras 0.1.5 torchaudio 2.0.1+cu118 torchmetrics 1.2.0 torchvision 0.15.1+cu118 tornado 6.3.3 tqdm 4.66.1 tracerite 1.1.0 trafilatura 1.6.2 transformers 4.35.2 translatehtml 1.5.2 translators 5.8.9 trio 0.22.2 trio-websocket 0.11.1 triton 2.0.0 Twisted 22.10.0 typer 0.7.0 typeshed-client 2.4.0 typing_extensions 4.8.0 tzdata 2023.3 tzlocal 5.2 uc-micro-py 1.0.2 ujson 5.8.0 Unidecode 1.3.7 uritools 4.0.2 urlextract 1.8.0 urllib3 1.26.18 uvicorn 0.24.0.post1 uvloop 0.19.0 validators 0.22.0 virtualenv 20.24.6 w3lib 2.1.2 waitress 2.1.2 wandb 0.15.12 warcio 1.7.4 wasabi 1.1.2 watchdog 3.0.0 wavedrom 2.0.3.post3 wcwidth 0.2.8 websockets 10.4 Werkzeug 2.2.2 wget 3.2 wheel 0.41.2 wrapt 1.16.0 wsproto 1.2.0 xlrd 1.2.0 XlsxWriter 3.1.9 xorbits 0.7.1 xoscar 0.1.4 xxhash 3.4.1 yarl 1.9.2 zipfile36 0.1.3 zipp 3.17.0 zope.event 5.0 zope.interface 6.1 zstandard 0.22.0 ```
zhijianma commented 9 months ago

Yes, we will change datatype of simhash to string, for pyarrow is incompatible with uint64 Now.