Open LoveCatc opened 1 year ago
Yesterday I waited for more than 12 hours to make sure it was really stuck instead of proceeding too slow.
I've had similar weird issues with load_dataset
as well. Not multiple files, but dataset is quite big, about 50G.
We use a generic multiprocessing code, so there is little we can do about this - unfortunately, turning off multiprocessing seems to be the only solution. Multithreading would make our code easier to maintain and (most likely) avoid issues such as this one, but we cannot use it until the GIL is dropped (no-GIL Python should be released in 2024, so we can start exploring this then)
The problem seems to be the Generating train split
. Is it possible to avoid that? I have a dataset saved, just want to load it but somehow running into issues with that again.
Hey guys, recently I ran into this problem again and I spent one whole day trying to locate the problem. Finally I found the problem seems to be with pyarrow
's json parser, and it seems a long-existing problem. Similar issue can be found in #2181. Anyway, my solution is to adjust the load_dataset
's parameter chunksize
. You can inspect the parameter set in datasets/packaged_modules/json/json.py
, now the actual chunksize should be very small, and you can increase the value. For me, chunksize=10<<23
could solve the stuck problem. But I also find that too big chunksize
, like 10 << 30
, would also cause a stuck, which is rather weird. I think I may explore this when I am free. And hope this can help those who also encounter the same problem.
Experiencing the same issue with the kaist-ai/Feedback-Collection
dataset, which is comparatively small i.e. 100k rows.
Code to reproduce
from datasets import load_dataset
dataset = load_dataset("kaist-ai/Feedback-Collection")
I have tried setting num_proc=1
as well as chunksize=1024, 64
but problem persists. Any pointers?
Describe the bug
I try to use
load_dataset()
to load several local.jsonl
files as a dataset. Every line of these files is a json structure only containing one keytext
(yeah it is a dataset for NLP model). The code snippet is as:However, I found that the loading process can get stuck -- the progress bar
Generating train split
no more proceed. When I was trying to find the cause and solution, I found a really strange behavior. If I load the dataset in this way:I can actually successfully load all the files despite its slow speed. But if I load them in batch like above, things go wrong. I did try to use Control-C to trace the stuck point but the program cannot be terminated in this way when
num_proc
is set toNone
. The only thing I can do is use Control-Z to hang it up then kill it. If I use more than 2 cpus, a Control-C would simply cause the following error:I have validated the basic correctness of these
.jsonl
files. They are correctly formatted (or they cannot be loaded singly byload_dataset
) though some of the json may contain too long text (more than 1e7 characters). I do not know if this could be the problem. And there should not be any bottleneck in system's resource. The whole dataset is ~300GB, and I am using a cloud server with plenty of storage and 1TB ram. Thanks for your efforts and patience! Any suggestion or help would be appreciated.Steps to reproduce the bug
data_files = LIST_OF_FILES
Expected behavior
All the files should be smoothly loaded.
Environment info
.jsonl
files. ~300GB in total. Each json structure only contains one key:text
. Format checked.datasets
version: 2.14.2