huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.76k stars 2.59k forks source link

AssertionError: daemonic processes are not allowed to have children #6089

Open codingl2k1 opened 11 months ago

codingl2k1 commented 11 months ago

Describe the bug

When I load_dataset with num_proc > 0 in a deamon process, I got an error:

  File "/Users/codingl2k1/Work/datasets/src/datasets/download/download_manager.py", line 564, in download_and_extract
    return self.extract(self.download(url_or_urls))
    ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/download/download_manager.py", line 427, in download
    downloaded_path_or_paths = map_nested(
    ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/utils/py_utils.py", line 468, in map_nested
    mapped = parallel_map(function, iterable, num_proc, types, disable_tqdm, desc, _single_map_nested)
    ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/utils/experimental.py", line 40, in _inner_fn
    return fn(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/parallel/parallel.py", line 34, in parallel_map
    return _map_with_multiprocessing_pool(
    ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/parallel/parallel.py", line 64, in _map_with_multiprocessing_pool
    with Pool(num_proc, initargs=initargs, initializer=initializer) as pool:
      ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
    ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
    ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
    ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/multiprocessing/pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/multiprocessing/process.py", line 118, in start
    assert not _current_process._config.get('daemon'),     ^^^^^^^^^^^^^^^^^
AssertionError: daemonic processes are not allowed to have children

The download is io-intensive computing, may be datasets can replece the multi processing pool by a multi threading pool if in a deamon process.

Steps to reproduce the bug

  1. start a deamon process
  2. run load_dataset with num_proc > 0

Expected behavior

No error.

Environment info

Python 3.11.4 datasets latest master

mariosasko commented 11 months ago

We could add a "threads" parallel backend to datasets.parallel.parallel_backend to support downloading with threads but note that download_and_extract also decompresses archives, and this is a CPU-intensive task, which is not ideal for (Python) threads (good for IO-intensive tasks).

codingl2k1 commented 11 months ago

We could add a "threads" parallel backend to datasets.parallel.parallel_backend to support downloading with threads but note that download_and_extract also decompresses archives, and this is a CPU-intensive task, which is not ideal for (Python) threads (good for IO-intensive tasks).

Great! Download takes more time than extract, multiple threads can download in parallel, which can speed up a lot.