huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

Cannot get item after loading from disk and then converting to iterable. #7065

Open happyTonakai opened 4 months ago

happyTonakai commented 4 months ago

Describe the bug

The dataset generated from local file works fine.

root = "/home/data/train"
file_list1 = glob(os.path.join(root, "*part1.flac"))
file_list2 = glob(os.path.join(root, "*part2.flac"))

ds = (
    Dataset.from_dict({"part1": file_list1, "part2": file_list2})
    .cast_column("part1", Audio(sampling_rate=None, mono=False))
    .cast_column("part2", Audio(sampling_rate=None, mono=False))
)

ids = ds.to_iterable_dataset(128)
ids = ids.shuffle(buffer_size=10000, seed=42)
dataloader = DataLoader(ids, num_workers=4, batch_size=8, persistent_workers=True)

for batch in dataloader:
    break

But after saving it to disk and then loading it from disk, I cannot get data as expected.

root = "/home/data/train"
file_list1 = glob(os.path.join(root, "*part1.flac"))
file_list2 = glob(os.path.join(root, "*part2.flac"))

ds = (
    Dataset.from_dict({"part1": file_list1, "part2": file_list2})
    .cast_column("part1", Audio(sampling_rate=None, mono=False))
    .cast_column("part2", Audio(sampling_rate=None, mono=False))
)

ds.save_to_disk("./train")

ds = datasets.load_from_disk("./train")

ids = ds.to_iterable_dataset(128)
ids = ids.shuffle(buffer_size=10000, seed=42)
dataloader = DataLoader(ids, num_workers=4, batch_size=8, persistent_workers=True)

for batch in dataloader:
    break

After a long time waiting, an error occurs:

Loading dataset from disk: 100%|█████████████████████████████████████████████████████████████████████████| 165/165 [00:00<00:00, 6422.18it/s]
Traceback (most recent call last):
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3490529) is killed by signal: Killed. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/home/hanzerui/workspace/NetEase/test/test_datasets.py", line 60, in <module>
    for batch in dataloader:
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
    idx, data = self._get_data()
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
    success, data = self._try_get_data()
  File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data
    raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 3490529) exited unexpectedly

It seems that streaming is not supported by laod_from_disk, so does that mean I cannot convert it to iterable?

Steps to reproduce the bug

  1. Create a Dataset from local files with from_dict
  2. Save it to disk with save_to_disk
  3. Load it from disk with load_from_disk
  4. Convert to iterable with to_iterable_dataset
  5. Loop the dataset

Expected behavior

Get items faster than the original dataset generated from dict.

Environment info