Closed ganeshkrishnan1 closed 5 months ago
We are not sure what resume/restart refers to in the context of handleing large amounts of data. How would you expect the existing trained model to change? It is theoretically possible, though not implemented, to intentionally extend vocab or update the unigram probabilities with new data, but the final model file should be different from the model trained with the original large data.
I mean both the model and vocab should change when we iteratively train on new data. Right now there are two challenges with this model 1) Unable to update model/vocab when we get new data 2) Does not support large corpus as the internal array overflow the memory.
We are trying to solve 2) for now by using LevelDB instead of the vector. If it's successful I will let you know. It will be painfully slow but can support virtually unlimited corpus size.
1) is good to have. Else the alternative is to keep collecting the data and then retrain from zero all over again on the combined data.
Because subword is a method where the vocabulary size is determined in advance, a theoretical definition of incremental training is not given at this moment.
Since the vocab size is small in subwording, sampling works in most cases. e.g., sampling only top 32k subwrods will not change drastically as long as the data is correctly sampled.
Is it possible to resume/restart training on new dataset? I saw #9 but that only has a workaround by reducing input_size. I would like to train the tokenizer by iterating through huge text files that won't fit the memory.
I can't find appropriate documentation for this and chatgpt is hallucinating answers about this.