Open codingl2k1 opened 11 months ago
We could add a "threads" parallel backend to datasets.parallel.parallel_backend
to support downloading with threads but note that download_and_extract
also decompresses archives, and this is a CPU-intensive task, which is not ideal for (Python) threads (good for IO-intensive tasks).
We could add a "threads" parallel backend to
datasets.parallel.parallel_backend
to support downloading with threads but note thatdownload_and_extract
also decompresses archives, and this is a CPU-intensive task, which is not ideal for (Python) threads (good for IO-intensive tasks).
Great! Download takes more time than extract, multiple threads can download in parallel, which can speed up a lot.
Describe the bug
When I load_dataset with num_proc > 0 in a deamon process, I got an error:
The download is io-intensive computing, may be datasets can replece the multi processing pool by a multi threading pool if in a deamon process.
Steps to reproduce the bug
Expected behavior
No error.
Environment info
Python 3.11.4 datasets latest master