stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.25k stars 888 forks source link

[QUESTION] Is constituency available in Stanza German Model? #1368

Closed GeorgeS2019 closed 5 months ago

GeorgeS2019 commented 6 months ago

Not sure where to look for this information, which processors are available for the German model

https://stanfordnlp.github.io/stanza/pipeline.html

AngledLuffa commented 6 months ago

The issue here is one of license. There is a large dataset available at Tübingen, but they are hesitant to license it in a situation that may result in commercial use. I'll ask again nicely with the thought being that it won't be possible to reconstruct the original dataset from our model. Another option is the Negra treebank, although that is significantly smaller. I will consult with my PI and hopefully we can put something together by the end of next week.

On Sat, Mar 16, 2024 at 12:46 PM GeorgeS2019 @.***> wrote:

Not sure where to look for this information, which processors are available for the German model

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWPMTZA5A2PNIBFFXW3YYSOQZAVCNFSM6AAAAABEZRWXKCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TAMRRHA3TCOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AngledLuffa commented 6 months ago

After looking through the options available to us, I think the SPMRL incarnation of the TIGER treebank might be the best option. It has 38600 training trees (after removing duplicates) and has been used elsewhere.

Do you want a model with or without transformer? With transformer will be much more accurate, but will take a lot more memory as well.

GeorgeS2019 commented 6 months ago

@AngledLuffa I think a data-driven transformer with better accuracy is more relevant.

The more accurate tree means a better chance to evaluate and benchmark how well the more accurate tree could handle a more complex German sentence, which is very typical in day-to-day ussage.

The more accurate tree will help downstream inference of other features of the sentence structure

AngledLuffa commented 6 months ago

I updated the models to use a German model from the SPMRL Tiger treebank using Electra. It gets 94.08 on the test set, but there's a little grade inflation so to speak if you compare it to certain other popular constituency parsers since there is an extra node in the top layer of most of the test trees.

https://huggingface.co/german-nlp-group/electra-base-german-uncased

It's compatible with the existing release, but I suggest installing the dev branch to at least save time when loading the entire pipeline - the current release would load 3 copies of the transformer, whereas the dev branch only loads from disk once and then copies on the GPU. The plan is to make a new release soon once we make it also only have one copy of the transformer on the GPU at a time

GeorgeS2019 commented 6 months ago

@AngledLuffa Still new to Stanza, not sure to work on dev branch having the model from huggingface, am I right?

Or the dev branch would automatically download the right compatible models?

AngledLuffa commented 6 months ago

The current release will download the models as well. However, the dev branch will load them faster. You can look up installing python modules from git repos, or I'll put something together after making a little more progress on the next release

AngledLuffa commented 5 months ago

This is now part of the 1.8.2 release

GeorgeS2019 commented 5 months ago

@AngledLuffa

Excellent. Great Job!