stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.25k stars 888 forks source link

Add a constituency model for Icelandic #1389

Open ingunnjk opened 5 months ago

ingunnjk commented 5 months ago

BEFORE YOU START: please make sure your pull request is against the dev branch. We cannot accept pull requests against the main branch. See our contributing guide for details.

Description

I want to add an Icelandic constituency model for Icelandic. I have made changes to prepare_con_dataset.py and added the script convert_icepahc.py in stanza/utils/datasets/constituency. I have also added an Icelandic BERT model to stanza/resources/default_packages.py.

Fixes Issues

A list of issues/bugs with # references. (e.g., #123)

Unit test coverage

Are there unit tests in place to make sure your code is functioning correctly? (see here for a simple example)

Known breaking changes/behaviors

None.

AngledLuffa commented 5 months ago

Hey, thanks for sending this. Very helpful.

Are you familiar with squashing? If so, would you be up for squashing commit #1 with #2 and #3 with #4? If not, I can do it on my end.

In your change comments, you talk about how there is a version which is simplified already. Is it necessary to use that for the parser? If yes, did you use a script to do those simplifications?

ingunnjk commented 5 months ago

Hi!

I have now squashed the commits!

I tried using the original IcePaHC file for the parser but that didn't seem to work. I also removed some things that shouldn't be a part of the parsing like empty nodes (traces, comments etc.) and some of it was by code and some by hand so there is unfortunately not a script for that, I'm sorry.

Also one other thing, you added to the list of functional tags to cut off since is used as a separator in the treebank file that is used for the training. I would actually prefer if it is not cut off! I removed the ending of most constituents, like ADJPOC, but I would like to keep some of them that I think would be useful, for example IPMAT. I trained a model where was cut off and then also when it is not cut off and the difference in scoring is only like 0,25% (from 90,63% to 90,38%). If it would then be possible to change the to - in the output from the parser, that would be greatly appreciated (or change * to - beforehand and not cutting - off for the Icelandic parser?).

AngledLuffa commented 5 months ago

I also removed some things that shouldn't be a part of the parsing like empty nodes (traces, comments etc.) and some of it was by code and some by hand so there is unfortunately not a script for that, I'm sorry.

The tree reading code automatically removes empty nodes. Perhaps it would be possible to recreate the conversion by hand. I'm not too opposed to using editing files, but it does look like they update this corpus fairly often:

https://github.com/antonkarl/icecorpus

which suggests that being able to automatically convert their work to the internal Stanza format would be useful.

Also one other thing, you added * to the list of functional tags to cut off since * is used as a separator in the treebank file that is used for the training

I see how that could be useful, in that English also has some tags such as NP-TMP which could be kept instead of discarded. I'll have to think about how to do that. Probably pass in the dataset as a parameter to read_treebank

ingunnjk commented 5 months ago

Okay! So would I have to do (or rather update) the conversion script in order to make the Icelandic constituency model a part of Stanza or could that come as an update later on?

AngledLuffa commented 5 months ago

In my ideal world, it would show up all at once :) If that doesn't work with you, we can also start from your converted version of the treebank. If you can send us the partially finished script, I can also help as needed, since you're already donating a bunch of work to us.

AngledLuffa commented 4 months ago

Any room for me to help with the conversion script?

ingunnjk commented 4 months ago

Sorry for the late answer, I haven't been able to work on this for the past weeks. In the conversion script (since the tree reading code automatically removes empty nodes) it is mostly a matter of removing lemmas (so (NS-D löndum-land) becomes (NS-D löndum)) and combining word parts that start/end with $ (in most cases a noun and its determiner), for example should (NS-N tungur$-tunga) (D-N $nar-hinn) become (NS-N tungurnar) and (RP upp$-upp) (VBPI $heldur-halda) should become (VBPI uppheldur). I don't know exactly when I'll be able to continue working on this (hopefully next week) but if you have the time you could for sure help move it along.

AngledLuffa commented 4 months ago

It's a little surprising to me that we would want to combine nodes that have determiners with their nouns. Then again, I do see in the UD version that there's a token uppheldur and no situation where the two tokens are split in pieces.

Either way, these two operations should be pretty straightforward to do with the Tsurgeon interface

https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/tsurgeon/Tsurgeon.html

... yuck, it seems I haven't written up how to use tsurgeon anywhere. I'll do that today and write up a brief example of the lemma removal. The node combination might be a little trickier, but if you tell me which tag to choose when combining two tags, that should be possible as well

On Fri, May 17, 2024 at 1:33 AM Ingunn Jóhanna Kristjánsdóttir < @.***> wrote:

Sorry for the late answer, I haven't been able to work on this for the past weeks. In the conversion script (since the tree reading code automatically removes empty nodes) it is mostly a matter of removing lemmas (so (NS-D löndum-land) becomes (NS-D löndum)) and combining word parts that start/end with $ (in most cases a noun and its determiner), for example should (NS-N tungur$-tunga) (D-N $nar-hinn) become (NS-N tungurnar) and (RP upp$-upp) (VBPI $heldur-halda) should become (VBPI uppheldur). I don't know exactly when I'll be able to continue working on this (hopefully next week) but if you have the time you could for sure help move it along.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/pull/1389#issuecomment-2117030884, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWJZBIAOOG6ZG3HNKITZCW6HLAVCNFSM6AAAAABGXGSW26VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJXGAZTAOBYGQ . You are receiving this because you commented.Message ID: @.***>

AngledLuffa commented 4 months ago

I checked in a script which uses Tsurgeon to convert a single tree:

https://github.com/stanfordnlp/stanza/commit/6b468b96e321431fbe894c24b907a2f8af62dca5

If you have a list of cases where $ combines two words into one word, I can expand that to cover the rest of the treebank. Obviously it's easy enough to combine neighboring $ words, but I'm not sure which ones take precedence in cases other than the two you've mentioned so far.

I put a brief introduction using this example on the github.io

https://stanfordnlp.github.io/stanza/tsurgeon.html

It doesn't really say anything other than the example I just checked in, though.

PS sorry for reusing your script's name and therefore making it slightly more annoying to do this PR

AngledLuffa commented 3 months ago

Any thoughts on the tregex / tsurgeon solution for this issue?

AngledLuffa commented 2 months ago

@ingunnjk any thoughts on how best to proceed with the conversion of the Icelandic dataset? If you have a set of heuristics on how to combine the $ nodes, I can turn that into a sequence of Tsurgeon expressions which hopefully handles the conversion from the raw dataset