Loading dataset from disk: 100%|█████████████████████████████████████████████████████████████████████████| 165/165 [00:00<00:00, 6422.18it/s]
Traceback (most recent call last):
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/multiprocessing/queues.py", line 113, in get
if not self._poll(timeout):
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/selectors.py", line 416, in select
fd_event_list = self._selector.poll(timeout)
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3490529) is killed by signal: Killed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
cli.main()
File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
runpy.run_path(target, run_name="__main__")
File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/hanzerui/.vscode-server/extensions/ms-python.debugpy-2024.9.12011011/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "/home/hanzerui/workspace/NetEase/test/test_datasets.py", line 60, in <module>
for batch in dataloader:
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
data = self._next_data()
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
success, data = self._try_get_data()
File "/home/hanzerui/.conda/envs/mss/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data
raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 3490529) exited unexpectedly
It seems that streaming is not supported by laod_from_disk, so does that mean I cannot convert it to iterable?
Steps to reproduce the bug
Create a Dataset from local files with from_dict
Save it to disk with save_to_disk
Load it from disk with load_from_disk
Convert to iterable with to_iterable_dataset
Loop the dataset
Expected behavior
Get items faster than the original dataset generated from dict.
Describe the bug
The dataset generated from local file works fine.
But after saving it to disk and then loading it from disk, I cannot get data as expected.
After a long time waiting, an error occurs:
It seems that streaming is not supported by
laod_from_disk
, so does that mean I cannot convert it to iterable?Steps to reproduce the bug
Dataset
from local files withfrom_dict
save_to_disk
load_from_disk
to_iterable_dataset
Expected behavior
Get items faster than the original dataset generated from dict.
Environment info
datasets
version: 2.20.0huggingface_hub
version: 0.23.2fsspec
version: 2024.5.0