huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.52k stars 2.53k forks source link

Segmentation fault #6858

Closed scampion closed 1 week ago

scampion commented 2 weeks ago

Describe the bug

Using various version for datasets, I'm no more longer able to load that dataset without a segmentation fault. Several others files are also concerned.

Steps to reproduce the bug

Create a new venv

python3 -m venv venv_test source venv_test/bin/activate

Install the latest version

pip install datasets

Load that dataset

python3 -q -X faulthandler -c "from datasets import load_dataset; load_dataset('EuropeanParliament/Eurovoc', '1998-09')"

Expected behavior

Data must be loaded

Environment info

datasets==2.19.0 Python 3.11.7 Darwin 22.5.0 Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64 x86_64

scampion commented 2 weeks ago

I downloaded the jsonl file and extract it manually. The issue seems to be related to pyarrow.json

python3 -q -X faulthandler -c "from datasets import load_dataset; load_dataset('json', data_files='/Users/scampion/Downloads/1998-09.jsonl')" Generating train split: 0 examples [00:00, ? examples/s]Fatal Python error: Segmentation fault

Thread 0x00007000000c1000 (most recent call first):

Thread 0x00007000024df000 (most recent call first): File "/usr/local/Cellar/python@3.11/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 331 in wait File "/usr/local/Cellar/python@3.11/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 629 in wait File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run File "/usr/local/Cellar/python@3.11/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1045 in _bootstrap_inner File "/usr/local/Cellar/python@3.11/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1002 in _bootstrap Thread 0x00007ff845c66640 (most recent call first): File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/packaged_modules/json/json.py", line 122 in _generate_tables File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1995 in _prepare_split_single File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1882 in _prepare_split File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1122 in _download_and_prepare File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1027 in download_and_prepare File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/load.py", line 2609 in load_dataset File "", line 1 in Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, charset_normalizer.md, yaml._yaml, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json (total: 72) [1] 56678 segmentation fault python3 -q -X faulthandler -c /usr/local/Cellar/python@3.11/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' (venv_test)
scampion commented 1 week ago

The error comes from data where one line contains "null"