stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.2k stars 885 forks source link

1.4.0 is buggy when it comes to some dependency parsing tasks, however, 1.3.0 works correctly #1160

Open apsyio opened 1 year ago

apsyio commented 1 year ago

I am using the dependency parser and noticed 1.4.0 has bugs that do not exist in 1.3.0. Here is an example:

If B is true and if C is false, perform D; else, perform E and perform F

in 1.3.0, 'else' is correctly detected as a child of the 'perform' coming after it; however, in 1.4.0, it is detected as a child of the 'perform' before it.

How can I force Stanza to load 1.3.0 instead of the latest version, so I can move forward with what I am doing now?

AngledLuffa commented 1 year ago

Technically you can just install an earlier version of Stanza. I'm not sure there's another great way to fix this. There are a couple instances of "or else" in the EWT training data, in which the "or else" has a head later in the sentence, but everything else is like "everything else" in that "else" depends on the previous word. You could suggest a couple sentences with "else" used in a different way and we can add those to the training data

AngledLuffa commented 1 year ago

I tried updating the dependency parser to use the pretrained character model rather than training its own, as the previous versions do, and while it improved LAS from 87.8 to 88.5, it didn't help that particular sentence. If you'd be interested in brainstorming a couple other examples of "else" used in this context instead of the more common "anyone else", "somewhere else", etc, we can add those to the supplemental training data and build a new model next week

AngledLuffa commented 11 months ago

I trained a model using electra-large as the input embedding, and it gets 91.95 on the EWT test set. A significant improvement! It also gets this particular example correct. It's not the default model because the transformer based models are a lot more expensive in general, but you can easily load it with the package parameter when creating a Pipeline