google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.07k stars 1.16k forks source link

resume/restart training of tokenizer #1018

Closed ganeshkrishnan1 closed 3 months ago

ganeshkrishnan1 commented 3 months ago

Is it possible to resume/restart training on new dataset? I saw #9 but that only has a workaround by reducing input_size. I would like to train the tokenizer by iterating through huge text files that won't fit the memory.

I can't find appropriate documentation for this and chatgpt is hallucinating answers about this.

taku910 commented 3 months ago

We are not sure what resume/restart refers to in the context of handleing large amounts of data. How would you expect the existing trained model to change? It is theoretically possible, though not implemented, to intentionally extend vocab or update the unigram probabilities with new data, but the final model file should be different from the model trained with the original large data.

ganeshkrishnan1 commented 3 months ago

I mean both the model and vocab should change when we iteratively train on new data. Right now there are two challenges with this model 1) Unable to update model/vocab when we get new data 2) Does not support large corpus as the internal array overflow the memory.

We are trying to solve 2) for now by using LevelDB instead of the vector. If it's successful I will let you know. It will be painfully slow but can support virtually unlimited corpus size.

1) is good to have. Else the alternative is to keep collecting the data and then retrain from zero all over again on the combined data.

taku910 commented 3 months ago

Because subword is a method where the vocabulary size is determined in advance, a theoretical definition of incremental training is not given at this moment.

Since the vocab size is small in subwording, sampling works in most cases. e.g., sampling only top 32k subwrods will not change drastically as long as the data is correctly sampled.