huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.26k stars 2.7k forks source link

Datasets.map is severely broken #6319

Open phalexo opened 1 year ago

phalexo commented 1 year ago

Describe the bug

Regardless of how many cores I used, I have 16 or 32 threads, map slows down to a crawl at around 80% done, lingers maybe until 97% extremely slowly and NEVER finishes the job. It just hangs.

After watching this for 27 hours I control-C out of it. Until the end one process appears to be doing something, but it never ends.

I saw some comments about fast tokenizers using Rust and all and tried different variations. NOTHING works.

Steps to reproduce the bug

Running it without breaking the dataset into parts results in the same behavior. The loop was an attempt to see if this was a RAM issue.

for idx in range(100): dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", cache_dir=cache_dir, split=f'train[{idx}%:{idx+1}%]') dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=1, remove_columns=["text", "meta"]) dataset.save_to_disk(training_args.cache_dir + f"/trainingdata{idx}")

Expected behavior

I expect map to run at more or less the same speed it starts with and FINISH its processing.

Environment info

Python 3.8, same with 3.10 makes no difference. Ubuntu 20.04,

mariosasko commented 1 year ago

Hi! Instead of processing a single example at a time, you should use the batched map for the best performance (with num_proc=1) - the fast tokenizers can process a batch's samples in parallel in that scenario.

E.g., the following code in Colab takes an hour to complete:

# !pip install datasets transformers
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
dataset = dataset.map(lambda ex: tokenizer(ex["text"]), batched=True, remove_columns=["text", "meta"])
phalexo commented 1 year ago

Batched is far worse. A single batch of 1000 took hours and that was only 1%

On Thu, Oct 19, 2023, 2:26 PM Mario Šaško @.***> wrote:

Hi! You should use the batched map for the best performance (with num_proc=1) - the fast tokenizers can process a batch's samples in parallel.

E.g., the following code in Colab takes an hour to complete:

!pip install datasets transformersfrom datasets import load_datasetfrom transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")dataset = dataset.map(lambda ex: tokenizer(ex["text"]), batched=True, remove_columns=["text", "meta"])

— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1771503757, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZJHPSRVDEXFNMXR2N3YAFWFZAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUYDGNZVG4 . You are receiving this because you authored the thread.Message ID: @.***>

mariosasko commented 1 year ago

Can you please provide a self-contained reproducer?

phalexo commented 1 year ago

Which specific version of datasets are you using?

What is the architecture of your colab setup? Ram? Cores? OS?

On Thu, Oct 19, 2023, 2:27 PM pensive introvert @.***> wrote:

Batched is far worse. A single batch of 1000 took hours and that was only 1%

On Thu, Oct 19, 2023, 2:26 PM Mario Šaško @.***> wrote:

Hi! You should use the batched map for the best performance (with num_proc=1) - the fast tokenizers can process a batch's samples in parallel.

E.g., the following code in Colab takes an hour to complete:

!pip install datasets transformersfrom datasets import load_datasetfrom transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")dataset = dataset.map(lambda ex: tokenizer(ex["text"]), batched=True, remove_columns=["text", "meta"])

— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1771503757, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZJHPSRVDEXFNMXR2N3YAFWFZAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUYDGNZVG4 . You are receiving this because you authored the thread.Message ID: @.***>

phalexo commented 1 year ago

from functools import partial import transformers from datasets import load_dataset, concatenate_datasets, load_from_disk

model_name_or_path="/opt/data/data/daryl149/llama-2-7b-chat-hf" output_dir="/opt/data/data/LongLoRA/checkpoints" cache_dir="/opt/data/data/LongLoRA/cache" model_max_length=16384

IGNORE_INDEX = -100 DEFAULT_PAD_TOKEN = "[PAD]" DEFAULT_EOS_TOKEN = "" DEFAULT_BOS_TOKEN = "" DEFAULT_UNK_TOKEN = ""

tokenizer = transformers.LlamaTokenizerFast.from_pretrained( model_name_or_path, cache_dir=cache_dir, model_max_length=model_max_length, padding_side="right", use_fast=True,

use_fast=False

)

special_tokens_dict = dict() if tokenizer.pad_token is None: special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN if tokenizer.eos_token is None: special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN if tokenizer.bos_token is None: special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN if tokenizer.unk_token is None: special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN

tokenizer.add_special_tokens(special_tokens_dict)

def tokenize_fn(tokenizer, example): context_length = tokenizer.model_max_length outputs = tokenizer( tokenizer.eos_token.join(example["text"]),

truncation=False,

    truncation=True,
    return_tensors="pt",
    #return_tensors="np",
    pad_to_multiple_of=context_length,
    padding=True,
)
return {"input_ids": outputs["input_ids"].view(-1, context_length)}

for idx in range(100): dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", cache_dir=cache_dir, split=f'train[{idx}%:{idx+1}%]') dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=16, remove_columns=["text", "meta"]) dataset.save_to_disk(training_args.cache_dir + f"/trainingdata{idx}")

On Thu, Oct 19, 2023 at 2:30 PM Mario Šaško @.***> wrote:

Can you please provide a self-contained reproducer?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1771509229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZNBZ3BE7Q4EQZZK6MLYAFWURAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUYDSMRSHE . You are receiving this because you authored the thread.Message ID: @.***>

phalexo commented 1 year ago

I changed the tokenizer to one without "Fast suffix, and something changed. The fraction, although still slowed a lot at 80% was able to get over the finish line of 100%

I have to do more testng, see if the whole set can be processed

On Thu, Oct 19, 2023 at 3:03 PM pensive introvert < @.***> wrote:

from functools import partial import transformers from datasets import load_dataset, concatenate_datasets, load_from_disk

model_name_or_path="/opt/data/data/daryl149/llama-2-7b-chat-hf" output_dir="/opt/data/data/LongLoRA/checkpoints" cache_dir="/opt/data/data/LongLoRA/cache" model_max_length=16384

IGNORE_INDEX = -100 DEFAULT_PAD_TOKEN = "[PAD]" DEFAULT_EOS_TOKEN = "" DEFAULT_BOS_TOKEN = "" DEFAULT_UNK_TOKEN = ""

tokenizer = transformers.LlamaTokenizerFast.from_pretrained( model_name_or_path, cache_dir=cache_dir, model_max_length=model_max_length, padding_side="right", use_fast=True,

use_fast=False

)

special_tokens_dict = dict() if tokenizer.pad_token is None: special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN if tokenizer.eos_token is None: special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN if tokenizer.bos_token is None: special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN if tokenizer.unk_token is None: special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN

tokenizer.add_special_tokens(special_tokens_dict)

def tokenize_fn(tokenizer, example): context_length = tokenizer.model_max_length outputs = tokenizer( tokenizer.eos_token.join(example["text"]),

truncation=False,

    truncation=True,
    return_tensors="pt",
    #return_tensors="np",
    pad_to_multiple_of=context_length,
    padding=True,
)
return {"input_ids": outputs["input_ids"].view(-1, context_length)}

for idx in range(100): dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", cache_dir=cache_dir, split=f'train[{idx}%:{idx+1}%]') dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=16, remove_columns=["text", "meta"]) dataset.save_to_disk(training_args.cache_dir + f"/trainingdata{idx}")

On Thu, Oct 19, 2023 at 2:30 PM Mario Šaško @.***> wrote:

Can you please provide a self-contained reproducer?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1771509229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZNBZ3BE7Q4EQZZK6MLYAFWURAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUYDSMRSHE . You are receiving this because you authored the thread.Message ID: @.***>

phalexo commented 1 year ago

So, using LlamaTokenizerFast was the problem. Changing it to LlamaTokenizer fixed things,

On Thu, Oct 19, 2023 at 4:04 PM pensive introvert < @.***> wrote:

I changed the tokenizer to one without "Fast suffix, and something changed. The fraction, although still slowed a lot at 80% was able to get over the finish line of 100%

I have to do more testng, see if the whole set can be processed

On Thu, Oct 19, 2023 at 3:03 PM pensive introvert < @.***> wrote:

from functools import partial import transformers from datasets import load_dataset, concatenate_datasets, load_from_disk

model_name_or_path="/opt/data/data/daryl149/llama-2-7b-chat-hf" output_dir="/opt/data/data/LongLoRA/checkpoints" cache_dir="/opt/data/data/LongLoRA/cache" model_max_length=16384

IGNORE_INDEX = -100 DEFAULT_PAD_TOKEN = "[PAD]" DEFAULT_EOS_TOKEN = "" DEFAULT_BOS_TOKEN = "" DEFAULT_UNK_TOKEN = ""

tokenizer = transformers.LlamaTokenizerFast.from_pretrained( model_name_or_path, cache_dir=cache_dir, model_max_length=model_max_length, padding_side="right", use_fast=True,

use_fast=False

)

special_tokens_dict = dict() if tokenizer.pad_token is None: special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN if tokenizer.eos_token is None: special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN if tokenizer.bos_token is None: special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN if tokenizer.unk_token is None: special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN

tokenizer.add_special_tokens(special_tokens_dict)

def tokenize_fn(tokenizer, example): context_length = tokenizer.model_max_length outputs = tokenizer( tokenizer.eos_token.join(example["text"]),

truncation=False,

    truncation=True,
    return_tensors="pt",
    #return_tensors="np",
    pad_to_multiple_of=context_length,
    padding=True,
)
return {"input_ids": outputs["input_ids"].view(-1, context_length)}

for idx in range(100): dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", cache_dir=cache_dir, split=f'train[{idx}%:{idx+1}%]') dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=16, remove_columns=["text", "meta"]) dataset.save_to_disk(training_args.cache_dir + f"/trainingdata{idx}")

On Thu, Oct 19, 2023 at 2:30 PM Mario Šaško @.***> wrote:

Can you please provide a self-contained reproducer?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1771509229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZNBZ3BE7Q4EQZZK6MLYAFWURAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUYDSMRSHE . You are receiving this because you authored the thread.Message ID: @.***>

mariosasko commented 1 year ago

Indeed, the tokenizer is super slow. Perhaps @ArthurZucker knows the reason why.

(This simplified Colab can be used to reproduce the behavior)

ewfian commented 1 year ago

same issue here sample to reproduce: https://github.com/philschmid/document-ai-transformers/blob/main/training/donut_sroie.ipynb with following map line https://github.com/philschmid/document-ai-transformers/blob/main/training/donut_sroie.ipynb

If I directly iterate over the dataset and call the mapping method, it is very fast

for sample in dataset:
    def preprocess_documents_for_donut(sample):

if i removed .convert('RGB') It can run to completion without getting stuck. I suspect it has something to do with the Image.

If I use batch, it's even slower.

mariosasko commented 11 months ago

@ewfian

If I directly iterate over the dataset and call the mapping method, it is very fast

Dataset.map must also convert the images into bytes to write them to an Arrow file (the write itself takes some time, too).

You can make the map faster by manually converting the images into an "arrow-compatible" representation. Otherwise, the Pillow defaults are used when saving an image, which seems particularly slow for the notebook's case.

def preprocess_documents_for_donut(sample):
    text = json.loads(sample["text"])
    d_doc = task_start_token + json2token(text) + eos_token
    image = sample["image"].convert('RGB')
    # convert image to bytes
    buffer = io.BytesIO()
    image.save(buffer, format="PNG", compress_level=1)
    return {"image": {"bytes": buffer.getvalue()}, "text": d_doc}

proc_dataset = dataset.map(preprocess_documents_for_donut, writer_batch_size=50)
phalexo commented 11 months ago

The problem I had was to do with map using fork and copying locks from the parent process in acquired state. I ended up changing the context to use forkserver instead.

On Wed, Nov 29, 2023, 10:04 PM Mario Šaško @.***> wrote:

@ewfian https://github.com/ewfian

If I directly iterate over the dataset and call the mapping method, it is very fast

Dataset.map must also convert the images into bytes to write them to an Arrow file (the write itself takes some time, too).

You can make the map faster by manually converting the images into an "arrow-compatible" representation. Otherwise, the Pillow defaults are used when saving an image, which seems particularly slow for the notebook's case.

def preprocess_documents_for_donut(sample): text = json.loads(sample["text"]) d_doc = task_start_token + json2token(text) + eos_token image = sample["image"].convert('RGB')

convert image to bytes

buffer = io.BytesIO()
image.save(buffer, format="PNG", compress_level=1)
return {"image": {"bytes": buffer.getvalue()}, "text": d_doc}

proc_dataset = dataset.map(preprocess_documents_for_donut, writer_batch_size=50)

— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1833033973, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZKKEKJVWBFH7QHLRJ3YG7ZUJAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTGAZTGOJXGM . You are receiving this because you authored the thread.Message ID: @.***>

yuji96 commented 3 months ago

I face the same issue many times.

Not only when using the transformers' tokenizer, but also when applying nltk's pos_tag to the entire English Wikipedia. So I suspect the cause is not in the tokenizer but in the Dataset.map

My case: At the beginning of the run, the speed was 600 samples/s, but it slowed down to 20 samples/s at around 90% (after 3 hours). I am concerned that the CPU usage was only about 5% at the end of the run, even though there was still lots of data left.

phalexo commented 3 months ago

It is the interaction of fork() inside the map and tokenizer mutexes/locks.

You have to set up your own process pool and use fork server instead of fork.

On Tue, Aug 6, 2024, 11:44 AM yuji96 @.***> wrote:

I face the same issue many times.

Not only when using the transformers' tokenizer, but also when applying nltk's pos_tag https://www.nltk.org/api/nltk.tag.pos_tag.html to the entire English Wikipedia. So I suspect the cause is not in the tokenizer but in the Dataset.map

My case: At the beginning of the run, the speed was 600 samples/s, but it slowed down to 20 samples/s at around 90% (after 3 hours). I am concerned that the CPU usage was only about 5% at the end of the run, even though there was still lots of data left.

6319 (comment)

https://github.com/huggingface/datasets/issues/6319#issuecomment-1771629160 It's very nice to hear that the run is complete, but the original issue has not been solved, which is that it gets slower and slower. As it is now, Dataset.map will not be able to handle the large datasets that are getting larger day by day.

— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-2271603976, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZLFHSIGNNXAEJJIXWLZQDVPTAVCNFSM6AAAAABMCTVK2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRGYYDGOJXGY . You are receiving this because you authored the thread.Message ID: @.***>

yuji96 commented 3 months ago

Thank you for your advice!

I added multiprocess.set_start_method("forkserver") but the result seemed to be the same. In my case, it may be due to the very simple fact that about 10% of the process, which includes long text, never ends. I'll try shard by data size. image

ArthurZucker commented 3 months ago

Would recommend using LlamaTokenizerFast not LlamaTokenizer !