Integrating PEFT into several different annotators
We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.
The biggest gains observed are with the constituency parser and the sentiment classifier.
Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.
Model improvements
POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results stanfordnlp/stanza#1320
Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. stanfordnlp/stanza#1335
NER also trained with peft: unfortunately, no consistent improvements to scores stanfordnlp/stanza#1336
Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser stanfordnlp/stanza#1341
Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. stanfordnlp/stanza#1347
Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. stanfordnlp/stanza#1348
Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. stanfordnlp/stanza#1346stanfordnlp/stanza#1345
Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice stanfordnlp/stanza#1354
V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. stanfordnlp/stanza#1350stanfordnlp/stanza#1294
Integrating PEFT into several different annotators
We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.
The biggest gains observed are with the constituency parser and the sentiment classifier.
... (truncated)
Commits
c2d72bd Will update Stanza version to quickly fix a few bugs
13ee3d5 Use a get() to avoid crashing if an older model with no bert_funetune set is ...
6e2520f Add some debug logging when building a retag_pipeline - goal is to make sure ...
44058a0 Allow for an individual pipeline to override which device it is placed on. F...
17eb6fc Quieter logging when building a peft wrapper
e89a7d4 Fix TOP_DOWN parser for da_arboretum, which needs to look at the actual root ...
5f18a61 Add the ability to process a few languages with prepare_resources.py
363bbec Update sentmient to also have charlm & transformer versions. The transformer...
4a7052b Minor logging / typo improvements to conparser
3e21404 Initial attempt to chop up long inputs to a transformer into pieces that the ...
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Bumps stanza from 1.7.0 to 1.8.1.
Release notes
Sourced from stanza's releases.
... (truncated)
Commits
c2d72bd
Will update Stanza version to quickly fix a few bugs13ee3d5
Use a get() to avoid crashing if an older model with no bert_funetune set is ...6e2520f
Add some debug logging when building a retag_pipeline - goal is to make sure ...44058a0
Allow for an individual pipeline to override which device it is placed on. F...17eb6fc
Quieter logging when building a peft wrappere89a7d4
Fix TOP_DOWN parser for da_arboretum, which needs to look at the actual root ...5f18a61
Add the ability to process a few languages with prepare_resources.py363bbec
Update sentmient to also have charlm & transformer versions. The transformer...4a7052b
Minor logging / typo improvements to conparser3e21404
Initial attempt to chop up long inputs to a transformer into pieces that the ...Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase
.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show