huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

Filter occasionally hangs #6393

Open dakinggg opened 1 year ago

dakinggg commented 1 year ago

Describe the bug

A call to .filter occasionally hangs (after the filter is complete, according to tqdm)

There is a trace produced

Exception ignored in: <function Dataset.__del__ at 0x7efb48130c10>
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/datasets/arrow_dataset.py", line 1366, in __del__
    if hasattr(self, "_indices"):
  File "/usr/lib/python3/dist-packages/composer/core/engine.py", line 123, in sigterm_handler
    sys.exit(128 + signal)
SystemExit: 143

but I'm not sure if the trace is actually from datasets, or from surrounding code that is trying to clean up after datasets gets stuck.

Unfortunately I can't reproduce this issue anywhere close to reliably. It happens infrequently when using num_procs > 1. Anecdotally I started seeing it when using larger datasets (~10M samples).

Steps to reproduce the bug

N/A see description

Expected behavior

map/filter calls always complete sucessfully

Environment info

dakinggg commented 1 year ago

It looks like I may not be the first to encounter this: https://github.com/huggingface/datasets/issues/3172

dakinggg commented 1 year ago

Adding some more information, it seems to occur more frequently with large (millions of samples) datasets.

dakinggg commented 1 year ago

More information. My code is structured as (1) load (2) map (3) filter (4) filter. It was always the second filter that failed. Combining the two filters into one seems to reliably work.

dakinggg commented 11 months ago

@lhoestq it'd be great if someone had a chance to look at this. I suspect it is impacting many users given the other issue that I linked.

lhoestq commented 11 months ago

Hi ! Sorry for the late response. Was it happening after the first or the second filter ?

It looks like an issue with the garbage collector (which makes it random). Maybe datasets created with filter are not always handled properly ? cc @mariosasko

dakinggg commented 11 months ago

It was after the second filter (and combining the two filters into one seemingly resolved it). I obviously haven't tried all settings to know that these details are causal, but it did work for me.

lhoestq commented 11 months ago

Thanks, that's good to know.

The stacktrace suggests an issue when del self._indices is called, which happens when a filtered dataset falls out of scope. The indices are a PyArrow table memory mapped from disk, so I'm not quite sure how calling del on it can cause this issue. We do del self._indices to make sure the file on disk is not used anymore by the current process and avoid e.g. permission errors.

Hopefully we can find a way to reproduce this error, otherwise it will be quite hard to understand what happened

dakinggg commented 11 months ago

Yeah, I have a reliable repro, but it is not even close to minimal and uses a dataset I can't share. Perhaps you could try getting close to my setting.

(1) make a large (~20GB) jsonl with prompt/response pairs (2) load it on a linux machine (dataset = load_dataset(...)) (3) map a tokenizer to it, with multiprocessing (tokenized_dataset = dataset.map(...)) (4) filter it once based on something, with multiprocessing (filtered_1 = tokenized_dataset.filter(...)) (5) filter it again based on something, with multiprocessing (filtered_2 = filtered_1.filter(...))

I included the variable names just in case it is relevant that I was creating new datasets each time, not overwriting the same variable.

dakinggg commented 8 months ago

@lhoestq I have another version of the repro that seems fairly reliably. I have lots of jsonl files, and I iteratively load each one with load_dataset('json', data_files='path/to/my/file.jsonl', streaming=False, split='train') and then dataset.map(..., num_proc=<int>). This iteration hangs in a random place each time. So seems like there is a bug that hits with some frequency.

dakinggg commented 8 months ago

With num_proc=None it works fine.

agokrani commented 8 months ago

I am also having similar issue to #3172 when trying to tokenize the data. My dataset contains 10M samples. Is there anything that could be done without having to split up the processing into multiple datasets?