Open sarves opened 3 years ago
There's no way to finetune a tokenizer model. You can rebuild it by adding the --force flag to the training script or by deleting the old model.
If you have more training data to add to the dataset, I suggest starting over with a combined dataset anyway. The existing Tamil dataset is quite small!
We did not use any alternate parameters in training the current Stanza models for Tamil.
Is there any outlook for adding the option to resume training an existing model for all pipeline modules? It would be great to be able to pick up where we left off.
One issue is there's normally a state for the optimizer, and the optimization will be theoretically worse if you save the model, then come back later and start training without that optimizer state. Having said that, we did accept a pull request recently for restarting training for the NER models. Haven't really considered adding the same thing for other models ourselves, though.
Could this be accomplished with state_dict
? It seems like it's able to store the optimizer state along with any other desired attributes. (Saving & Loading a General Checkpoint for Inference and/or Resuming Training)
Yes, it certainly can. For most models, the saving and loading of state_dict is in {model}/trainer.py The one thing we'd want to consider is whether or not it makes sense to keep the optimizer state in the saved files all the time. My guess is we'd want to make it an option so that in general the model files don't get bloated. If you want to make a pull request with such a feature, including an option to turn it off (or evidence that it doesn't really increase the size of the model files) we will be happy to merge it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello,
I was wondering if there is any update on that issue... Is there a way that I can fine-tune an existing model? As I see, it is already implemented for the NER task only.
Is this relative to Tamil or just a general question of retraining the tokenizers?
It takes less than 20 minutes to retrain a tokenizer on a 2080ti, so we are unlikely to put in the effort for adding checkpointing specifically for that model. If there's a specific one you want to retrain, we can help make sure you have the right data for the task.
At some point, checkpoints might be available for all of the models if we integrate Lightning into our models, but there's no timeline for doing that.
So, stanza doesn't currently support fine-tuning on the Tokenize-Processor, POS-Processor, Lemma-Processor and Depparse-Processor... As I am not an expert, should I try to modify the code in a similar way like you did for the NER models fine-tuning?
There are considerations with each of those:
Tokenize: still not worth the effort
MWT (perhaps the language you care about doesn't have MWT): need to consider adding values to the deterministic dictionary
POS: there is a cutoff in terms of how many times it needs to see a word before adding it to the fine tuned word embedding (a different use of "fine tuned"). So if the problem here is that it's not a recognizing a rare word, you would need to adjust the fine tuned word embedding to include that new word
Lemmatizer: same thing as MWT, it has a dictionary
Depparse: same as POS, there's a cutoff for number of times to see a word
Sentiment & conparse: similar dictionary & word embedding issue as for POS & depparse, but no cutoff
Also, for each of those, ideally the original pytorch optimizer would have been saved with the models. That would greatly increase the size of the models, and shipping two versions of the models so people can fine tune them seems like a lot of work for very little benefit.
Actually, when you get down to it, the NER change to add fine tuning ignored all of those issues and just restarted training from the current weights with a new optimizer. I kinda suspect it's not that useful because of that. For example, if you fine tune the NER with a bunch of people the model isn't recognizing, it will forget how to identify org and loc in the process.
FWIW, the models which take the longest to train - charlm, sentiment, and conparse - now have checkpoints automatically saved & reloaded as of v1.4.1
Hello,
I was wondering if there is any update on that issue... Is there a way that I can fine-tune an existing model? As I see, it is already implemented for the NER task only.
Where do you see Finetuning implemented for NER task?
Are you looking for the ability to continue training after training has been interrupted, or are you looking to teach an existing model a few new things? The second ability currently exists for NER. The checkpoint functionality I described for some of the other models doesn't exist, and probably won't be added in the next few weeks, but I can make it more of a priority if people need it (although I still believe the model can simply be retrained from scratch in a relatively short amount of time)
Hi,
My requirement is that I want to detect Street name using NER model(for German language). So I want to fine tune existing NER model with my custom street training data. Please let me know if I could teach existing NER based model these additional Street detection task?
On Tue, 28 Nov 2023 at 1:54 AM, John Bauer @.***> wrote:
Are you looking for the ability to continue training after training has been interrupted, or are you looking to teach an existing model a few new things? The second ability currently exists for NER. The checkpoint functionality I described for some of the other models doesn't exist, and probably won't be added in the next few weeks, but I can make it more of a priority if people need it (although I still believe the model can simply be retrained from scratch in a relatively short amount of time)
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/724#issuecomment-1828535118, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYPEOQXQJLJL3PGH5OWVWLYGTZHXAVCNFSM462KSC22U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBSHA2TGNJRGE4A . You are receiving this because you commented.Message ID: @.***>
That's kind of what I'm talking about, you're not going to successfully finetune our existing NER model for that task no matter what you do. (*) If you have street name training data, just use that to train a new German model instead
(*) is that I could certainly see adding a new class and freezing the layers other than the output layer for that one class producing a model that does ... something, but my intuition tells me it would not be as good as starting from scratch anyway. Obviously someone can come along and prove me wrong, but it'd have to be someone other than myself, because I genuinely don't have the capacity for this experiment right now
Thanks you very much for your prompt response. Let me check if I could support here by doing some experimentation.
One more point I wanted to check: I have seen transformer way of code here present at: https://github.com/stanfordnlp/stanza/tree/main/stanza/models/ner
is this can be used for NER model training? If yes then seems documentation is missing here . It will be really helpful if we have documents like type of input data expecting and all.
There's a couple documents on training NER models:
https://stanfordnlp.github.io/stanza/new_language_ner.html
https://stanfordnlp.github.io/stanza/retrain_ner.html
Training a new model on an existing language will be closer to the "new_language_ner" page than the retraining page, but you won't need to find new word vectors or build the charlm. Feel free to ask if you have any questions, but I do suggest starting a new issue so we can keep this issue focused on checkpoint improvements.
Hi, just to be clear. For example, I want to fine tune and train the i2b2 model with the similar datasets to improve its capabilities especially for the extraction of problem tag and then for the other languages. Could you show me the way for this? Best
There's certainly no way to extend the English i2b2 model to another language.
Frankly the best way to do this would be to start from the i2b2 dataset (you can start your search here https://www.i2b2.org/NLP/DataSets/) and either 1) retrain with a more effective base embedding, probably a transformer, possibly a transformer specifically built for the field and/or 2) augment the dataset with more annotations that cover the cases you are concerned about.
There is a flag in ner_tagger.py which lets you start training from an existing model (--finetune) but unless your finetuning data includes enough data for each of the existing classes, you'll just wipe out the model's knowledge of those classes by doing this, so you're going to need to track down the i2b2 dataset to make any actual progress on this problem.
Thanks for quick reply and info. Let me explain myself in detail. My first aim is to extract the symptoms/problems from the text. Thus we prepared datasets fit into training requirements of stanza models (tarin, dev, test, BIO tagged in json format). Now we want to retrain the Stanza biomedical NER models (or transformers) with those datasets to increase the extraction accuracy. In the second step of our project, we will need different languages. Thus we need datasets to be translated into those languages and then retrain the existing models (Stanza biomedical NER models or transformers) again to extract the symptoms/problems from the different languages.
I prepared those flags for ner_tagger:
python3 -m stanza.models.ner_tagger --wordvec_dir /....../word2vec --train_file /........train.json --eval_file /.....dev.json --eval_output_file /......./evaluation_results --mode train --finetune --finetune_load_name i2b2 --train_classifier_only --shorthand en_medicalner --scheme bio --train_scheme bio --gradient_checkpointing
But it could not find the i2b2 model. Is those correct or how can I correct? or Is this possible? or Should I use the transformer model to retrain instead of Stanza Biomedical NER model?
Best
Oh, if what you needed was the i2b2
model itself, you can do
pipe = stanza.download("en", processors="ner", package="i2b2")
It should download both the NER model and its pretrain & charlms
Thanks I downloaded with that code but ner_tagger gives still an error: FileNotFoundError: [Errno 2] No such file or directory: 'i2b2'. What should I do to use this model to fine tune with my dataset via ner_tagger?
I downloaded with that code but ner_tagger gives still an error: FileNotFoundError: [Errno 2] No such file or directory: 'i2b2'.
What did you do to run ner_tagger
?
You want to give it whatever path for the NER model download, possibly ~/stanza_resources/en/ner/i2b2.pt
, possibly somewhere else if you're running on Windows or if you've changed the resources download directory
Thanks I found.
Hi
1. Is there a way that I can fine-tune an existing model?
For instance, already there is a model for Tamil (ttb.pt) in Stanza. Can I use that train or fine-tune with more Tamil data, instead of training from the scratch? Currently (with Stanza 1.2), if a model.pt exist in saved_model directory, the training process is skipped.
2. Where can I find the value of parameters like drop_out, --batch_size etc used to train the current Stanza model for Tamil? (are they same as the default values provided in respective.py, for instance https://github.com/stanfordnlp/stanza/blob/main/stanza/models/tokenizer.py)
Thank you