How to tokenize big dataset

puraminy commented 3 years ago

Based on examples, I am trying to train a tokenizer and a model for T5. I use Google Colab pro, when I tried to run the following code:

import datasets

from t5_tokenizer_model import SentencePieceUnigramTokenizer

vocab_size = 32_000
input_sentence_size = None # change to 100_000 works

# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")

tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")

print("len dataset:", len(dataset))

# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
    if input_sentence_size is None:
        input_sentence_size = len(dataset)
    batch_length = 100
    for i in range(0, input_sentence_size, batch_length):
        yield dataset[i: i + batch_length]["text"]

# Train tokenizer
tokenizer.train_from_iterator(
    iterator=batch_iterator(input_sentence_size=input_sentence_size),
    vocab_size=vocab_size,
    show_progress=True,
)

# Save files to disk
tokenizer.save("/content/drive/MyDrive/Pouramini/tokenizer.json")

It get stuck in train_from_iterator because the size of dataset is large (input_sentence_size is around 8M sentences) How can I divide the dataset and run the code on each block and then merge them to a tokenizer output?

LysandreJik commented 3 years ago

There is no implementation of anything like that yet - but thanks for the request, that's an interesting feature request cc @SaulLu @sgugger

NielsRogge commented 3 years ago

@LysandreJik I believe there is. You can check out the guide here: https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling#train-tokenizer

This guide does exactly what you're requesting: it splits up the dataset in batches, such that the tokenizer can be trained properly.

sgugger commented 3 years ago

This is exactly what is done in the code sample above as well. I'm not too sure I understand what the feature request here is: the training does not get stuck, it just takes along time to finish. There are no progress bars in notebooks, which is a feature you can request on Tokenizers.

At some point, training a tokenizer on such a large dataset in Colab is counter-productive, this environment is not appropriate for CPU intensive work like this. You should spin a CPU instance (those are very cheap) to train your tokenizer then upload the result to the Hub to re-use it once you are ready to train.

puraminy commented 3 years ago

Thanks, anyway I trained the tokenzier, and the batch iterator could be a solution, however as @sgugger pointed out with no progress bar, that made me doubtful what is happening!

but what about dataset? I meant this line

dataset = load_dataset("oscar", "unshuffled_deduplicated_no", split="train")

Does it load the whole dataset in memory and then calls batch iterator on it? if yes, then a batch iterator fro reading a dataset from a file could fix need for large memory.

puraminy commented 3 years ago

A related issue with the same example code is this new one:

https://github.com/huggingface/transformers/issues/13878#issue-1016415036

I would be very grateful if one answers me.

sgugger commented 3 years ago

No the Datasets library never loads the samples unless you request them, using Apache Arrow behind the scenes (you cna read more in the documentation). Using the batch iterator as you did will never load the full dataset in memory.

huggingface / transformers

How to tokenize big dataset #13844