Loading local datasets got strangely stuck

LoveCatc commented 1 year ago

Describe the bug

I try to use load_dataset() to load several local .jsonl files as a dataset. Every line of these files is a json structure only containing one key text (yeah it is a dataset for NLP model). The code snippet is as:

ds = load_dataset("json", data_files=LIST_OF_FILE_PATHS, num_proc=16)['train']

However, I found that the loading process can get stuck -- the progress bar Generating train split no more proceed. When I was trying to find the cause and solution, I found a really strange behavior. If I load the dataset in this way:

dlist = list()
for _ in LIST_OF_FILE_PATHS:
    dlist.append(load_dataset("json", data_files=_)['train'])
ds = concatenate_datasets(dlist)

I can actually successfully load all the files despite its slow speed. But if I load them in batch like above, things go wrong. I did try to use Control-C to trace the stuck point but the program cannot be terminated in this way when num_proc is set to None. The only thing I can do is use Control-Z to hang it up then kill it. If I use more than 2 cpus, a Control-C would simply cause the following error:

^C
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 114, in worker
    task = get()
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/queues.py", line 368, in get
    res = self._reader.recv_bytes()
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 224, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 422, in _recv_bytes
    buf = self._recv(4)
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 387, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
Generating train split: 92431 examples [01:23, 1104.25 examples/s]  
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 1373, in iflatmap_unordered
    yield queue.get(timeout=0.05)
  File "<string>", line 2, in get
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/managers.py", line 818, in _callmethod
    kind, result = conn.recv()
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 258, in recv
    buf = self._recv_bytes()
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 422, in _recv_bytes
    buf = self._recv(4)
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 387, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/data/liyongyuan/source/batch_load.py", line 11, in <module>
    a = load_dataset(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2133, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1842, in _prepare_split
    for job_id, done, content in iflatmap_unordered(
  File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 770, in get
    raise TimeoutError
multiprocess.context.TimeoutError

I have validated the basic correctness of these .jsonl files. They are correctly formatted (or they cannot be loaded singly by load_dataset) though some of the json may contain too long text (more than 1e7 characters). I do not know if this could be the problem. And there should not be any bottleneck in system's resource. The whole dataset is ~300GB, and I am using a cloud server with plenty of storage and 1TB ram. Thanks for your efforts and patience! Any suggestion or help would be appreciated.

Steps to reproduce the bug

use load_dataset() with data_files = LIST_OF_FILES

Expected behavior

All the files should be smoothly loaded.

Environment info

Datasets: A private dataset. ~2500 .jsonl files. ~300GB in total. Each json structure only contains one key: text. Format checked.
datasets version: 2.14.2
Platform: Linux-4.19.91-014.kangaroo.alios7.x86_64-x86_64-with-glibc2.35
Python version: 3.10.6
Huggingface_hub version: 0.15.1
PyArrow version: 10.0.1.dev0+ga6eabc2b.d20230609
Pandas version: 1.5.2

LoveCatc commented 1 year ago

Yesterday I waited for more than 12 hours to make sure it was really stuck instead of proceeding too slow.

harpone commented 1 year ago

I've had similar weird issues with load_dataset as well. Not multiple files, but dataset is quite big, about 50G.

mariosasko commented 1 year ago

We use a generic multiprocessing code, so there is little we can do about this - unfortunately, turning off multiprocessing seems to be the only solution. Multithreading would make our code easier to maintain and (most likely) avoid issues such as this one, but we cannot use it until the GIL is dropped (no-GIL Python should be released in 2024, so we can start exploring this then)

harpone commented 1 year ago

The problem seems to be the Generating train split. Is it possible to avoid that? I have a dataset saved, just want to load it but somehow running into issues with that again.

LoveCatc commented 1 year ago

Hey guys, recently I ran into this problem again and I spent one whole day trying to locate the problem. Finally I found the problem seems to be with pyarrow's json parser, and it seems a long-existing problem. Similar issue can be found in #2181. Anyway, my solution is to adjust the load_dataset's parameter chunksize. You can inspect the parameter set in datasets/packaged_modules/json/json.py, now the actual chunksize should be very small, and you can increase the value. For me, chunksize=10<<23 could solve the stuck problem. But I also find that too big chunksize, like 10 << 30, would also cause a stuck, which is rather weird. I think I may explore this when I am free. And hope this can help those who also encounter the same problem.

RomanEngeler1805 commented 8 months ago

Experiencing the same issue with the kaist-ai/Feedback-Collection dataset, which is comparatively small i.e. 100k rows. Code to reproduce

from datasets import load_dataset
dataset = load_dataset("kaist-ai/Feedback-Collection")

I have tried setting num_proc=1 as well as chunksize=1024, 64 but problem persists. Any pointers?

huggingface / datasets