Closed scampion closed 1 week ago
I downloaded the jsonl file and extract it manually. The issue seems to be related to pyarrow.json
python3 -q -X faulthandler -c "from datasets import load_dataset; load_dataset('json', data_files='/Users/scampion/Downloads/1998-09.jsonl')" Generating train split: 0 examples [00:00, ? examples/s]Fatal Python error: Segmentation fault
Thread 0x00007000000c1000 (most recent call first):
The error comes from data where one line contains "null"
Describe the bug
Using various version for datasets, I'm no more longer able to load that dataset without a segmentation fault. Several others files are also concerned.
Steps to reproduce the bug
Create a new venv
python3 -m venv venv_test source venv_test/bin/activate
Install the latest version
pip install datasets
Load that dataset
python3 -q -X faulthandler -c "from datasets import load_dataset; load_dataset('EuropeanParliament/Eurovoc', '1998-09')"
Expected behavior
Data must be loaded
Environment info
datasets==2.19.0 Python 3.11.7 Darwin 22.5.0 Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64 x86_64