huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.26k stars 2.7k forks source link

using map on loaded Tokenizer 10x - 100x slower than default Tokenizer? #1830

Open wumpusman opened 3 years ago

wumpusman commented 3 years ago

This could total relate to me misunderstanding particular call functions, but I added words to a GPT2Tokenizer, and saved it to disk (note I'm only showing snippets but I can share more) and the map function ran much slower:

def save_tokenizer(original_tokenizer,text,path="simpledata/tokenizer"):
    words_unique = set(text.split(" "))
    for i in words_unique:
        original_tokenizer.add_tokens(i)
    original_tokenizer.save_pretrained(path)

tokenizer2 = GPT2Tokenizer.from_pretrained(os.path.join(experiment_path,experiment_name,"tokenizer_squad"))

train_set_baby=Dataset.from_dict({"text":[train_set["text"][0][0:50]]})

I then applied the dataset map function on a fairly small set of text:

%%time
train_set_baby = train_set_baby.map(lambda d:tokenizer2(d["text"]),batched=True)

The run time for train_set_baby.map was 6 seconds, and the batch itself was 2.6 seconds

100% 1/1 [00:02<00:00, 2.60s/ba] CPU times: user 5.96 s, sys: 36 ms, total: 5.99 s Wall time: 5.99 s

In comparison using (even after adding additional tokens): tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

%%time
train_set_baby = train_set_baby.map(lambda d:tokenizer2(d["text"]),batched=True)

The time is 100% 1/1 [00:00<00:00, 34.09ba/s] CPU times: user 68.1 ms, sys: 16 µs, total: 68.1 ms Wall time: 62.9 ms

It seems this might relate to the tokenizer save or load function, however, the issue appears to come up when I apply the loaded tokenizer to the map function.

I should also add that playing around with the amount of words I add to the tokenizer before I save it to disk and load it into memory appears to impact the time it takes to run the map function.

lhoestq commented 3 years ago

Hi @wumpusman datasets has a caching mechanism that allows to cache the results of .map so that when you want to re-run it later it doesn't recompute it again. So when you do .map, what actually happens is:

  1. compute the hash used to identify your map for the cache
  2. apply your function on every batch

This can explain the time difference between your different experiments.

The hash computation time depends of how complex your function is. For a tokenizer, the hash computation scans the lists of the words in the tokenizer to identify this tokenizer. Usually it takes 2-3 seconds.

Also note that you can disable caching though using

import datasets

datasets.set_caching_enabled(False)
wumpusman commented 3 years ago

Hi @lhoestq ,

Thanks for the reply. It's entirely possible that is the issue. Since it's a side project I won't be looking at it till later this week, but, I'll verify it by disabling caching and hopefully I'll see the same runtime.

Appreciate the reference,

Michael

johncookds commented 3 years ago

I believe this is an actual issue, tokenizing a ~4GB txt file went from an hour and a half to ~10 minutes when I switched from my pre-trained tokenizer(on the same dataset) to the default gpt2 tokenizer. Both were loaded using:

AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

I trained the tokenizer using ByteLevelBPETokenizer from the Tokenizers library and save it to a tokenizer.json file.

I have tested the caching ideas above, changing the number of process, the TOKENIZERS_PARALLELISM env variable, keep_in_memory=True and batching with different sizes.

Apologies I can't really upload much code, but wanted to back up the finding and hopefully a fix/the problem can be found. I will comment back if I find a fix as well.

lhoestq commented 3 years ago

Hi @johncookds do you think this can come from one tokenizer being faster than the other one ? Can you try to compare their speed without using datasets just to make sure ?

johncookds commented 3 years ago

Hi yes, I'm closing the loop here with some timings below. The issue seems to be at least somewhat/mainly with the tokenizer's themselves. Moreover legacy saves of the trainer tokenizer perform faster but differently than the new tokenizer.json saves(note nothing about the training process/adding of special tokens changed between the top two trained tokenizer tests, only the way it was saved). This is only a 3x slowdown vs like a 10x but I think the slowdown is most likely due to this.

trained tokenizer - tokenizer.json save (same results for AutoTokenizer legacy_format=False):
Tokenizer time(seconds): 0.32767510414123535
Tokenized avg. length: 323.01

trained tokenizer - AutoTokenizer legacy_format=True:
Tokenizer time(seconds): 0.09258866310119629
Tokenized avg. length: 301.01

GPT2 Tokenizer from huggingface
Tokenizer time(seconds): 0.1010282039642334
Tokenized avg. length: 461.21
wumpusman commented 3 years ago

@lhoestq ,

Hi, which version of datasets has datasets.set_caching_enabled(False)? I get module 'datasets' has no attribute 'set_caching_enabled'. To hopefully get around this, I reran my code on a new set of data, and did so only once.

@johncookds , thanks for chiming in, it looks this might be an issue of Tokenizer.

Tokenizer: The runtime of GPT2TokenizerFast.from_pretrained("gpt2") on 1000 chars is: 143 ms SlowTokenizer: The runtime of a locally saved and loaded Tokenizer using the same vocab on 1000 chars is: 4.43 s

That being said, I compared performance on the map function:

Running Tokenizer versus using it in the map function for 1000 chars goes from 141 ms to 356 ms Running SlowTokenizer versus using it in the map function for 1000 chars with a single element goes from 4.43 s to 9.76 s

I'm trying to figure out why the overhead of map would increase the time by double (figured it would be a fixed increase in time)? Though maybe this is expected behavior.

@lhoestq, do you by chance know how I can redirect this issue to Tokenizer?

Regards,

Michael

lhoestq commented 3 years ago

Thanks for the experiments @johncookds and @wumpusman !

Hi, which version of datasets has datasets.set_caching_enabled(False)?

Currently you have to install datasets from source to have this feature, but this will be available in the next release in a few days.

I'm trying to figure out why the overhead of map would increase the time by double (figured it would be a fixed increase in time)? Though maybe this is expected behavior.

Could you also try with double the number of characters ? This should let us have an idea of the fixed cost (hashing) and the dynamic cost (actual tokenization, grows with the size of the input)

@lhoestq, do you by chance know how I can redirect this issue to Tokenizer?

Feel free to post an issue on the transformers repo. Also I'm sure there should be related issues so you can also look for someone with the same concerns on the transformers repo.

wumpusman commented 3 years ago

@lhoestq,

I just checked that previous run time was actually 3000 chars. I increased it to 6k chars, again, roughly double.

SlowTokenizer 7.4 s to 15.7 s Tokenizer: 276 ms to 616 ms

I'll post this issue on Tokenizer, seems it hasn't quite been raised (albeit I noticed a similar issue that might relate).

Regards,

Michael

johncookds commented 3 years ago

Hi, I'm following up here as I found my exact issue. It was with saving and re-loading the tokenizer. When I trained then processed the data without saving and reloading it, it was 10x-100x faster than when I saved and re-loaded it. Both resulted in the exact same tokenized datasets as well. There is additionally a bug where the older legacy tokenizer save does not preserve a learned tokenizing behavior if trained from scratch. Understand its not exactly Datasets related but hope it can help someone if they have the same issue. Thanks!