Open phalexo opened 1 year ago
Hi! Instead of processing a single example at a time, you should use the batched map
for the best performance (with num_proc=1
) - the fast tokenizers can process a batch's samples in parallel in that scenario.
E.g., the following code in Colab takes an hour to complete:
# !pip install datasets transformers
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
dataset = dataset.map(lambda ex: tokenizer(ex["text"]), batched=True, remove_columns=["text", "meta"])
Batched is far worse. A single batch of 1000 took hours and that was only 1%
On Thu, Oct 19, 2023, 2:26 PM Mario Šaško @.***> wrote:
Hi! You should use the batched map for the best performance (with num_proc=1) - the fast tokenizers can process a batch's samples in parallel.
E.g., the following code in Colab takes an hour to complete:
!pip install datasets transformersfrom datasets import load_datasetfrom transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")dataset = dataset.map(lambda ex: tokenizer(ex["text"]), batched=True, remove_columns=["text", "meta"])
— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1771503757, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZJHPSRVDEXFNMXR2N3YAFWFZAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUYDGNZVG4 . You are receiving this because you authored the thread.Message ID: @.***>
Can you please provide a self-contained reproducer?
Which specific version of datasets are you using?
What is the architecture of your colab setup? Ram? Cores? OS?
On Thu, Oct 19, 2023, 2:27 PM pensive introvert @.***> wrote:
Batched is far worse. A single batch of 1000 took hours and that was only 1%
On Thu, Oct 19, 2023, 2:26 PM Mario Šaško @.***> wrote:
Hi! You should use the batched map for the best performance (with num_proc=1) - the fast tokenizers can process a batch's samples in parallel.
E.g., the following code in Colab takes an hour to complete:
!pip install datasets transformersfrom datasets import load_datasetfrom transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")dataset = dataset.map(lambda ex: tokenizer(ex["text"]), batched=True, remove_columns=["text", "meta"])
— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1771503757, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZJHPSRVDEXFNMXR2N3YAFWFZAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUYDGNZVG4 . You are receiving this because you authored the thread.Message ID: @.***>
from functools import partial import transformers from datasets import load_dataset, concatenate_datasets, load_from_disk
model_name_or_path="/opt/data/data/daryl149/llama-2-7b-chat-hf" output_dir="/opt/data/data/LongLoRA/checkpoints" cache_dir="/opt/data/data/LongLoRA/cache" model_max_length=16384
IGNORE_INDEX = -100
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = ""
DEFAULT_BOS_TOKEN = ""
DEFAULT_UNK_TOKEN = "
tokenizer = transformers.LlamaTokenizerFast.from_pretrained( model_name_or_path, cache_dir=cache_dir, model_max_length=model_max_length, padding_side="right", use_fast=True,
)
special_tokens_dict = dict() if tokenizer.pad_token is None: special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN if tokenizer.eos_token is None: special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN if tokenizer.bos_token is None: special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN if tokenizer.unk_token is None: special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN
tokenizer.add_special_tokens(special_tokens_dict)
def tokenize_fn(tokenizer, example): context_length = tokenizer.model_max_length outputs = tokenizer( tokenizer.eos_token.join(example["text"]),
truncation=True,
return_tensors="pt",
#return_tensors="np",
pad_to_multiple_of=context_length,
padding=True,
)
return {"input_ids": outputs["input_ids"].view(-1, context_length)}
for idx in range(100): dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", cache_dir=cache_dir, split=f'train[{idx}%:{idx+1}%]') dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=16, remove_columns=["text", "meta"]) dataset.save_to_disk(training_args.cache_dir + f"/trainingdata{idx}")
On Thu, Oct 19, 2023 at 2:30 PM Mario Šaško @.***> wrote:
Can you please provide a self-contained reproducer?
— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1771509229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZNBZ3BE7Q4EQZZK6MLYAFWURAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUYDSMRSHE . You are receiving this because you authored the thread.Message ID: @.***>
I changed the tokenizer to one without "Fast suffix, and something changed. The fraction, although still slowed a lot at 80% was able to get over the finish line of 100%
I have to do more testng, see if the whole set can be processed
On Thu, Oct 19, 2023 at 3:03 PM pensive introvert < @.***> wrote:
from functools import partial import transformers from datasets import load_dataset, concatenate_datasets, load_from_disk
model_name_or_path="/opt/data/data/daryl149/llama-2-7b-chat-hf" output_dir="/opt/data/data/LongLoRA/checkpoints" cache_dir="/opt/data/data/LongLoRA/cache" model_max_length=16384
IGNORE_INDEX = -100 DEFAULT_PAD_TOKEN = "[PAD]" DEFAULT_EOS_TOKEN = "" DEFAULT_BOS_TOKEN = "
" DEFAULT_UNK_TOKEN = "" tokenizer = transformers.LlamaTokenizerFast.from_pretrained( model_name_or_path, cache_dir=cache_dir, model_max_length=model_max_length, padding_side="right", use_fast=True,
use_fast=False
)
special_tokens_dict = dict() if tokenizer.pad_token is None: special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN if tokenizer.eos_token is None: special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN if tokenizer.bos_token is None: special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN if tokenizer.unk_token is None: special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN
tokenizer.add_special_tokens(special_tokens_dict)
def tokenize_fn(tokenizer, example): context_length = tokenizer.model_max_length outputs = tokenizer( tokenizer.eos_token.join(example["text"]),
truncation=False,
truncation=True, return_tensors="pt", #return_tensors="np", pad_to_multiple_of=context_length, padding=True, ) return {"input_ids": outputs["input_ids"].view(-1, context_length)}
for idx in range(100): dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", cache_dir=cache_dir, split=f'train[{idx}%:{idx+1}%]') dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=16, remove_columns=["text", "meta"]) dataset.save_to_disk(training_args.cache_dir + f"/trainingdata{idx}")
On Thu, Oct 19, 2023 at 2:30 PM Mario Šaško @.***> wrote:
Can you please provide a self-contained reproducer?
— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1771509229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZNBZ3BE7Q4EQZZK6MLYAFWURAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUYDSMRSHE . You are receiving this because you authored the thread.Message ID: @.***>
So, using LlamaTokenizerFast was the problem. Changing it to LlamaTokenizer fixed things,
On Thu, Oct 19, 2023 at 4:04 PM pensive introvert < @.***> wrote:
I changed the tokenizer to one without "Fast suffix, and something changed. The fraction, although still slowed a lot at 80% was able to get over the finish line of 100%
I have to do more testng, see if the whole set can be processed
On Thu, Oct 19, 2023 at 3:03 PM pensive introvert < @.***> wrote:
from functools import partial import transformers from datasets import load_dataset, concatenate_datasets, load_from_disk
model_name_or_path="/opt/data/data/daryl149/llama-2-7b-chat-hf" output_dir="/opt/data/data/LongLoRA/checkpoints" cache_dir="/opt/data/data/LongLoRA/cache" model_max_length=16384
IGNORE_INDEX = -100 DEFAULT_PAD_TOKEN = "[PAD]" DEFAULT_EOS_TOKEN = "" DEFAULT_BOS_TOKEN = "
" DEFAULT_UNK_TOKEN = "" tokenizer = transformers.LlamaTokenizerFast.from_pretrained( model_name_or_path, cache_dir=cache_dir, model_max_length=model_max_length, padding_side="right", use_fast=True,
use_fast=False
)
special_tokens_dict = dict() if tokenizer.pad_token is None: special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN if tokenizer.eos_token is None: special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN if tokenizer.bos_token is None: special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN if tokenizer.unk_token is None: special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN
tokenizer.add_special_tokens(special_tokens_dict)
def tokenize_fn(tokenizer, example): context_length = tokenizer.model_max_length outputs = tokenizer( tokenizer.eos_token.join(example["text"]),
truncation=False,
truncation=True, return_tensors="pt", #return_tensors="np", pad_to_multiple_of=context_length, padding=True, ) return {"input_ids": outputs["input_ids"].view(-1, context_length)}
for idx in range(100): dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", cache_dir=cache_dir, split=f'train[{idx}%:{idx+1}%]') dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=16, remove_columns=["text", "meta"]) dataset.save_to_disk(training_args.cache_dir + f"/trainingdata{idx}")
On Thu, Oct 19, 2023 at 2:30 PM Mario Šaško @.***> wrote:
Can you please provide a self-contained reproducer?
— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1771509229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZNBZ3BE7Q4EQZZK6MLYAFWURAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUYDSMRSHE . You are receiving this because you authored the thread.Message ID: @.***>
Indeed, the tokenizer is super slow. Perhaps @ArthurZucker knows the reason why.
(This simplified Colab can be used to reproduce the behavior)
same issue here sample to reproduce: https://github.com/philschmid/document-ai-transformers/blob/main/training/donut_sroie.ipynb with following map line https://github.com/philschmid/document-ai-transformers/blob/main/training/donut_sroie.ipynb
If I directly iterate over the dataset and call the mapping method, it is very fast
for sample in dataset:
def preprocess_documents_for_donut(sample):
if i removed .convert('RGB')
It can run to completion without getting stuck. I suspect it has something to do with the Image.
If I use batch, it's even slower.
@ewfian
If I directly iterate over the dataset and call the mapping method, it is very fast
Dataset.map
must also convert the images into bytes to write them to an Arrow file (the write itself takes some time, too).
You can make the map
faster by manually converting the images into an "arrow-compatible" representation. Otherwise, the Pillow defaults are used when saving an image, which seems particularly slow for the notebook's case.
def preprocess_documents_for_donut(sample):
text = json.loads(sample["text"])
d_doc = task_start_token + json2token(text) + eos_token
image = sample["image"].convert('RGB')
# convert image to bytes
buffer = io.BytesIO()
image.save(buffer, format="PNG", compress_level=1)
return {"image": {"bytes": buffer.getvalue()}, "text": d_doc}
proc_dataset = dataset.map(preprocess_documents_for_donut, writer_batch_size=50)
The problem I had was to do with map using fork and copying locks from the parent process in acquired state. I ended up changing the context to use forkserver instead.
On Wed, Nov 29, 2023, 10:04 PM Mario Šaško @.***> wrote:
@ewfian https://github.com/ewfian
If I directly iterate over the dataset and call the mapping method, it is very fast
Dataset.map must also convert the images into bytes to write them to an Arrow file (the write itself takes some time, too).
You can make the map faster by manually converting the images into an "arrow-compatible" representation. Otherwise, the Pillow defaults are used when saving an image, which seems particularly slow for the notebook's case.
def preprocess_documents_for_donut(sample): text = json.loads(sample["text"]) d_doc = task_start_token + json2token(text) + eos_token image = sample["image"].convert('RGB')
convert image to bytes
buffer = io.BytesIO() image.save(buffer, format="PNG", compress_level=1) return {"image": {"bytes": buffer.getvalue()}, "text": d_doc}
proc_dataset = dataset.map(preprocess_documents_for_donut, writer_batch_size=50)
— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-1833033973, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZKKEKJVWBFH7QHLRJ3YG7ZUJAVCNFSM6AAAAAA6HDKPSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTGAZTGOJXGM . You are receiving this because you authored the thread.Message ID: @.***>
I face the same issue many times.
Not only when using the transformers' tokenizer, but also when applying nltk's pos_tag to the entire English Wikipedia. So I suspect the cause is not in the tokenizer but in the Dataset.map
My case: At the beginning of the run, the speed was 600 samples/s, but it slowed down to 20 samples/s at around 90% (after 3 hours). I am concerned that the CPU usage was only about 5% at the end of the run, even though there was still lots of data left.
It is the interaction of fork() inside the map and tokenizer mutexes/locks.
You have to set up your own process pool and use fork server instead of fork.
On Tue, Aug 6, 2024, 11:44 AM yuji96 @.***> wrote:
I face the same issue many times.
Not only when using the transformers' tokenizer, but also when applying nltk's pos_tag https://www.nltk.org/api/nltk.tag.pos_tag.html to the entire English Wikipedia. So I suspect the cause is not in the tokenizer but in the Dataset.map
My case: At the beginning of the run, the speed was 600 samples/s, but it slowed down to 20 samples/s at around 90% (after 3 hours). I am concerned that the CPU usage was only about 5% at the end of the run, even though there was still lots of data left.
6319 (comment)
https://github.com/huggingface/datasets/issues/6319#issuecomment-1771629160 It's very nice to hear that the run is complete, but the original issue has not been solved, which is that it gets slower and slower. As it is now, Dataset.map will not be able to handle the large datasets that are getting larger day by day.
— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6319#issuecomment-2271603976, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZLFHSIGNNXAEJJIXWLZQDVPTAVCNFSM6AAAAABMCTVK2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRGYYDGOJXGY . You are receiving this because you authored the thread.Message ID: @.***>
Thank you for your advice!
I added multiprocess.set_start_method("forkserver")
but the result seemed to be the same. In my case, it may be due to the very simple fact that about 10% of the process, which includes long text, never ends. I'll try shard by data size.
Would recommend using LlamaTokenizerFast
not LlamaTokenizer
!
Describe the bug
Regardless of how many cores I used, I have 16 or 32 threads, map slows down to a crawl at around 80% done, lingers maybe until 97% extremely slowly and NEVER finishes the job. It just hangs.
After watching this for 27 hours I control-C out of it. Until the end one process appears to be doing something, but it never ends.
I saw some comments about fast tokenizers using Rust and all and tried different variations. NOTHING works.
Steps to reproduce the bug
Running it without breaking the dataset into parts results in the same behavior. The loop was an attempt to see if this was a RAM issue.
for idx in range(100): dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", cache_dir=cache_dir, split=f'train[{idx}%:{idx+1}%]') dataset = dataset.map(partial(tokenize_fn, tokenizer), batched=False, num_proc=1, remove_columns=["text", "meta"]) dataset.save_to_disk(training_args.cache_dir + f"/trainingdata{idx}")
Expected behavior
I expect map to run at more or less the same speed it starts with and FINISH its processing.
Environment info
Python 3.8, same with 3.10 makes no difference. Ubuntu 20.04,