Open dakinggg opened 1 year ago
It looks like I may not be the first to encounter this: https://github.com/huggingface/datasets/issues/3172
Adding some more information, it seems to occur more frequently with large (millions of samples) datasets.
More information. My code is structured as (1) load (2) map (3) filter (4) filter. It was always the second filter that failed. Combining the two filters into one seems to reliably work.
@lhoestq it'd be great if someone had a chance to look at this. I suspect it is impacting many users given the other issue that I linked.
Hi ! Sorry for the late response. Was it happening after the first or the second filter ?
It looks like an issue with the garbage collector (which makes it random). Maybe datasets created with filter
are not always handled properly ? cc @mariosasko
It was after the second filter (and combining the two filters into one seemingly resolved it). I obviously haven't tried all settings to know that these details are causal, but it did work for me.
Thanks, that's good to know.
The stacktrace suggests an issue when del self._indices
is called, which happens when a filtered dataset falls out of scope. The indices are a PyArrow table memory mapped from disk, so I'm not quite sure how calling del
on it can cause this issue. We do del self._indices
to make sure the file on disk is not used anymore by the current process and avoid e.g. permission errors.
Hopefully we can find a way to reproduce this error, otherwise it will be quite hard to understand what happened
Yeah, I have a reliable repro, but it is not even close to minimal and uses a dataset I can't share. Perhaps you could try getting close to my setting.
(1) make a large (~20GB) jsonl with prompt/response pairs
(2) load it on a linux machine (dataset = load_dataset(...)
)
(3) map a tokenizer to it, with multiprocessing (tokenized_dataset = dataset.map(...)
)
(4) filter it once based on something, with multiprocessing (filtered_1 = tokenized_dataset.filter(...)
)
(5) filter it again based on something, with multiprocessing (filtered_2 = filtered_1.filter(...)
)
I included the variable names just in case it is relevant that I was creating new datasets each time, not overwriting the same variable.
@lhoestq I have another version of the repro that seems fairly reliably. I have lots of jsonl files, and I iteratively load each one with load_dataset('json', data_files='path/to/my/file.jsonl', streaming=False, split='train')
and then dataset.map(..., num_proc=<int>)
. This iteration hangs in a random place each time. So seems like there is a bug that hits with some frequency.
With num_proc=None
it works fine.
I am also having similar issue to #3172 when trying to tokenize the data. My dataset contains 10M samples. Is there anything that could be done without having to split up the processing into multiple datasets?
Describe the bug
A call to
.filter
occasionally hangs (after the filter is complete, according to tqdm)There is a trace produced
but I'm not sure if the trace is actually from
datasets
, or from surrounding code that is trying to clean up after datasets gets stuck.Unfortunately I can't reproduce this issue anywhere close to reliably. It happens infrequently when using
num_procs > 1
. Anecdotally I started seeing it when using larger datasets (~10M samples).Steps to reproduce the bug
N/A see description
Expected behavior
map/filter calls always complete sucessfully
Environment info
datasets
version: 2.14.6