Open puraminy opened 3 years ago
There is no implementation of anything like that yet - but thanks for the request, that's an interesting feature request cc @SaulLu @sgugger
@LysandreJik I believe there is. You can check out the guide here: https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling#train-tokenizer
This guide does exactly what you're requesting: it splits up the dataset in batches, such that the tokenizer can be trained properly.
This is exactly what is done in the code sample above as well. I'm not too sure I understand what the feature request here is: the training does not get stuck, it just takes along time to finish. There are no progress bars in notebooks, which is a feature you can request on Tokenizers.
At some point, training a tokenizer on such a large dataset in Colab is counter-productive, this environment is not appropriate for CPU intensive work like this. You should spin a CPU instance (those are very cheap) to train your tokenizer then upload the result to the Hub to re-use it once you are ready to train.
Thanks, anyway I trained the tokenzier, and the batch iterator could be a solution, however as @sgugger pointed out with no progress bar, that made me doubtful what is happening!
but what about dataset? I meant this line
dataset = load_dataset("oscar", "unshuffled_deduplicated_no", split="train")
Does it load the whole dataset in memory and then calls batch iterator on it? if yes, then a batch iterator fro reading a dataset from a file could fix need for large memory.
A related issue with the same example code is this new one:
https://github.com/huggingface/transformers/issues/13878#issue-1016415036
I would be very grateful if one answers me.
No the Datasets library never loads the samples unless you request them, using Apache Arrow behind the scenes (you cna read more in the documentation). Using the batch iterator as you did will never load the full dataset in memory.
Based on examples, I am trying to train a tokenizer and a model for T5. I use Google Colab pro, when I tried to run the following code:
It get stuck in
train_from_iterator
because the size of dataset is large (input_sentence_size
is around 8M sentences) How can I divide the dataset and run the code on each block and then merge them to a tokenizer output?