Bump stanza from 1.7.0 to 1.8.1

Bumps stanza from 1.7.0 to 1.8.1.

Release notes

PEFT Integration (with bugfixes)

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results stanfordnlp/stanza#1320

Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. stanfordnlp/stanza#1335

NER also trained with peft: unfortunately, no consistent improvements to scores stanfordnlp/stanza#1336

depparse includes peft: no consistent improvements yet stanfordnlp/stanza#1337 stanfordnlp/stanza#1344

Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser stanfordnlp/stanza#1341

Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. stanfordnlp/stanza#1347

Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. stanfordnlp/stanza#1348

Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. stanfordnlp/stanza#1346 stanfordnlp/stanza#1345

Features

Include SpacesAfter annotations on words in the CoNLL output of documents: stanfordnlp/stanza#1315 stanfordnlp/stanza#1322

Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. stanfordnlp/stanza#1331 stanfordnlp/stanza#1330

wandb support for coref stanfordnlp/stanza#1338

Coref annotator breaks length ties using POS if available stanfordnlp/stanza#1326 https://github.com/stanfordnlp/stanza/commit/c4c3de5803f27843a5050e10ccae71b3fd9c45e9

Bugfixes

Using a proxy with download_resources_json was broken: stanfordnlp/stanza#1318 stanfordnlp/stanza#1317 Thank you @ider-zh

Fix deprecation warnings for escape sequences: stanfordnlp/stanza#1321 stanfordnlp/stanza#1293 Thank you @sterliakov

Coref training rounding error stanfordnlp/stanza#1342

Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice stanfordnlp/stanza#1354

V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. stanfordnlp/stanza#1350 stanfordnlp/stanza#1294

Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: stanfordnlp/stanza#1333 stanfordnlp/stanza#1339 https://github.com/stanfordnlp/stanza/commit/f1fbaaad983e58dc3fcf318200d685663fb90737

Clarify error when a language is only partially handled: https://github.com/stanfordnlp/stanza/commit/da01644b4ba5ba477c36e5d2736012b81bcd00d4 stanfordnlp/stanza#1310

Additional 1.8.1 Bugfixes

Older POS models not loaded correctly... need to use .get() https://github.com/stanfordnlp/stanza/commit/13ee3d5cbc2c9174c3e0c67ca75b580e4fe683b1 stanfordnlp/stanza#1357

Debug logging for the Constituency retag pipeline to better support someone working on Icelandic https://github.com/stanfordnlp/stanza/commit/6e2520f24d63fa8af4136f10137e57b195fda20a stanfordnlp/stanza#1356

device arg in MultilingualPipeline would crash if device was passed for an individual Pipeline: https://github.com/stanfordnlp/stanza/commit/44058a0ec296c6da5997bfaf8911a26d425d2cec

PEFT integration

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

... (truncated)

Commits

c2d72bd Will update Stanza version to quickly fix a few bugs
13ee3d5 Use a get() to avoid crashing if an older model with no bert_funetune set is ...
6e2520f Add some debug logging when building a retag_pipeline - goal is to make sure ...
44058a0 Allow for an individual pipeline to override which device it is placed on. F...
17eb6fc Quieter logging when building a peft wrapper
e89a7d4 Fix TOP_DOWN parser for da_arboretum, which needs to look at the actual root ...
5f18a61 Add the ability to process a few languages with prepare_resources.py
363bbec Update sentmient to also have charlm & transformer versions. The transformer...
4a7052b Minor logging / typo improvements to conparser
3e21404 Initial attempt to chop up long inputs to a transformer into pieces that the ...
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

nlpie / biomedicus

Bump stanza from 1.7.0 to 1.8.1 #305

PEFT Integration (with bugfixes)

Integrating PEFT into several different annotators

Model improvements

Features

Bugfixes

Additional 1.8.1 Bugfixes

PEFT integration

Integrating PEFT into several different annotators