Feature request: Support for Swedish NER

EmilStenstrom commented 2 years ago

I was hoping that stanza would be willing to add support for NER in Swedish.

For this, you need a Swedish dataset in IOB2 format. And I have a way to get a file like that using the SUC 3.0 corpus, and a small converter script that I wrote: https://github.com/EmilStenstrom/suc_to_iob

The license for the corpus us CC BY-SA, and it does not affect the license of your model file (see the repo above for licensing clarifications).

How can I help to get this done? :)

EmilStenstrom commented 2 years ago

ping @AngledLuffa which found my project before.

AngledLuffa commented 2 years ago

Hi! In fact this is already available in the dev branch, both the shuffled free version based on your code and the official version which requires a license. I can retrain with the improvements to your code, though! Have you thought about turning that into a pip module? That would be more convenient for us long term, although a bit more effort on your end.

How useful would a constituency parser be? We have a smallish dataset for Swedish, but training a model based on that hasn't been a priority until now.

On Mon, Jan 3, 2022 at 3:55 AM Emil Stenström @.***> wrote:

ping @AngledLuffa https://github.com/AngledLuffa which found my project before I did.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/912#issuecomment-1004042143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWP2ICKQHZ6KGWZSX73UUGFD7ANCNFSM5LFCNEOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

EmilStenstrom commented 2 years ago

Awesome! From reading the docs, I think the official version is missing the ne tags (?). Anyhow, I think the improvements I made today will improve the quality well enough so that it's worth retraining, thanks! :)

I'll make it a pip module if that's more convenient for you! Let me get back on this.

About a constituency parser: I haven't personally had a use-case for this, but I'm sure there are other Swedes that would appreciate this a lot.

Any idea of when your next release will be? So looking forward to this!

AngledLuffa commented 2 years ago

It does appear that there are some more tags in the public dataset which can be merged as you've done, but the official order of docs & the official train/dev/test split is available in the licensed version. Not sure which will be better.

I don't think it's necessary to make a pip module. For now, I simply copied the script into our repo. You can see how that looks here:

https://github.com/stanfordnlp/stanza/pull/913

As for the next release, I'll have to discuss with my PI. What's been happening lately is that generally when people bring us an issue, I explain how to install the dev branch and they go away, so there hasn't been a lot of pressure to get it done. Probably when there's a clear line to be drawn for "here is a finished set of features we want".

pip install git+git://github.com/stanfordnlp/stanza.git@fa17aa55f7343e8743069a78dea09fa366547452

This has two versions of a Swedish NER model: suc3 trained using the tags and splits provided in the licensed version of suc3, and suc3shuffle, trained using the "name" tags from a random split of the public suc3 dataset using the previous version of your script. I'm currently training a new model using the new version of your script; I can keep that with a separate name for now if you want to compare all three, or I can just get rid of last week's model if you don't care too much about it

EmilStenstrom commented 2 years ago

I think a pip module makes more sense no matter what. That way I can make improvements to the script over time, and you wouldn't have to copy-paste things over, and instead could just update a requirements file. It's not a lot a work for me, I have several other projects already.

Installing from git is fine for personal projects, but to be able to use this at work being able to reference a real pip package is definitely better. No rush, I was just curious.

I'm sure the official suc3 version is best for evaluation and comparing performance against academic papers. I'm fairly certain that the shuffled version with ne tags is much better in practice. I would set the shuffled version to suc3, and the official version as suc3-official, just to steer people in the right direction.

In my view, the official version has several issues that just makes it worse for NER than the shuffled version:

Some countries are labelled as "inst" instead of LOC.
Other tags are used when there should be LOC or ORG
Person tags contain titles as part of the name, so "uncle Bob" instead of just "Bob" which is much more useful
TME tags are available in the ne tags but not officially

You can definitely get rid over last week's model, since it only used the automated ne tags it was strictly just emulating another automated system. I manually went through the diff between the generated file before and after and all changes are strictly improvements over the last version.

AngledLuffa commented 2 years ago

FWIW last week's model used the name tags, not the ne tags. You can see in the stanza-internal PR that I had added a flag to use the name field instead of ne.

Thanks for the feedback! No one else has ever voted on Swedish NER, so I'll just take this advice until someone provides a compelling reason otherwise.

I'll update with results as soon as I have them.

AngledLuffa commented 2 years ago

Alright, here's some results...

I wouldn't put too much stock in the difference in scores between the shuffled and official dataset. The shuffled dataset mixes all the documents together, so it is almost definitely an easier dataset because of that.

I will note that the names of the labels themselves are a lot easier to read in the old version of the shuffled or in the licensed dataset. I can handle that on my end, but it might also be a useful mapping to add to the preparation script.

New version of the shuffled dataset:

2022-01-03 16:02:27 INFO: Score by entity:
Prec.   Rec.    F1
85.79   83.48   84.62
2022-01-03 16:02:27 INFO: Score by token:
Prec.   Rec.    F1
85.67   84.92   85.29
2022-01-03 16:02:27 INFO: NER tagger score:
2022-01-03 16:02:27 INFO: sv_suc3shuffle 84.62
2022-01-03 16:02:27 INFO: NER token confusion matrix:
     t\p       O    ANI    EVN    LOC    MSR    MYT    OBJ    ORG    PRS    TME    WRK
         O 161974      0      7     36     79      3     22     96     66    422    116
       ANI      0     24      0      0      0      0      0      3     12      0      0
       EVN     27      2     47      5      0      1      3     17      3      0      1
       LOC     77      0      1   1707      0      0      0     48     36      0      7
       MSR     48      0      0      0    893      0      0      0      0     16      0
       MYT      3      0      0      1      0     43      0      0     15      0      0
       OBJ     45      2      2      5      0      0     74      5     20      0      2
       ORG    124      0      0     63      0      0     10    915     50      0     15
       PRS    148      1      2     18      0      2      0     15   3419      0     20
       TME    258      0      0      1     11      0      0      0      0   3506      0
       WRK    227      0      2     14      0      0      0     43     65      4    438

Old version of the shuffled dataset:

2022-01-03 13:52:23 INFO: Score by entity:
Prec.   Rec.    F1
84.92   83.50   84.21
2022-01-03 13:52:24 INFO: Score by token:
Prec.   Rec.    F1
84.87   84.33   84.60
2022-01-03 13:52:24 INFO: NER tagger score:
2022-01-03 13:52:24 INFO: sv_suc3shuffle 84.21
2022-01-03 13:52:24 INFO: NER token confusion matrix:
       t\p            O    animal     event      inst      myth     other    person     place   product      work
            O    167760         0         6        71         4         1        67        34        16       120
       animal         1        28         0         0         0         0         9         1         0         0
        event        11         2        26        22         1         0         0         0         1         1
         inst        81         2         0       977         0         0        41       114         5        16
         myth         4         0         0         0        45         0        12         0         1         0
        other        20         0         2        26         0         7        14        41         7         9
       person        71         1         0        11         1         0      3384        11         3        10
        place        31         0         1        71         0         1        22      1304         0         2
      product        20         3         0         5         0         0        18         4        75         0
         work       126         0         0        25         0         0        50         9         0       522

Licensed dataset:

2022-01-03 16:21:58 INFO: Score by entity:
Prec.   Rec.    F1
82.13   82.95   82.54
2022-01-03 16:21:58 INFO: Score by token:
Prec.   Rec.    F1
82.42   82.98   82.70
2022-01-03 16:21:58 INFO: NER tagger score:
2022-01-03 16:21:58 INFO: sv_suc3 82.54
2022-01-03 16:21:58 INFO: NER token confusion matrix:
       t\p            O     event      inst      myth    person     place      work     other   product
            O     22530         0        11         0        13        12         7         0         0
        event         0         0         0         0         0         5         0         0         0
         inst        14         0        99         0         6        20         1         0         0
         myth         1         0         0         9         3         0         0         0         0
       person        10         0         0         0       343         8         0         0         0
        place         9         0        16         0         7       172         0         0         0
         work         4         0         0         0         4         0        13         0         0
        other         0         0         0         0         0         1         0         0         0
      product         0         0         0         0         0         0         1         0         0

AngledLuffa commented 2 years ago

Is this a mapping I can use? Question marks on the ones I'm not sure about

ANI   animal
EVN   event
LOC   place
MSR   product ?
MYT   myth
OBJ   other ?
ORG   inst ?
PRS   person
TME   time
WRK   work

EmilStenstrom commented 2 years ago

The mapping is actually something I made up. I thought the standard was to use three letter names for tags, and I've seen LOC PRS and ORG used in other datasets? Here are the tags from CoNLL2003's NER tagging workshop for instance: https://www.clips.uantwerpen.be/conll2003/ner/lists/eng.list - Do you agree that these kinds of tags make sense?

Tags are just abbreviations of their english names. I've mapped the name tags to these, to match the ne tags. I did put an explaination of the tags here: https://github.com/EmilStenstrom/suc_to_iob/#available-tags

EmilStenstrom commented 2 years ago

Happy to see that the new version performs slightly better than the old shuffled version. I wonder if it would be a good idea to also map animal -> person as the annotation guidelines suggest in CoNNL2003, and maybe also merge MSR and TME?

AngledLuffa commented 2 years ago

It's kind of an English-centric view, but knowing English, I can look at PER, LOC, ORG and guess pretty quickly what those mean. I don't have any clue what MSR is, and I can only guess MYT because I know the original dataset has myth. I'd go with slightly longer names even if they need to be abbreviated sometimes (inst instead of institution) The 18 class models tend to spell out the full names of most of the NEs for the same reason.

I don't have a strong opinion on animal, partly because I'm not familiar with the data. If it's tagging "Shamu", that seems pretty hard to distinguish from person, but if it's tagging "whale", that might be useful as a separate entity type

EmilStenstrom commented 2 years ago

So I did my homework (looking at the actual data) and now have a new version out:

PRS is now PER. It's what everyone else uses.
myth is now PER. It was 99.9% references to gods names, which could be PER.
animal is now PER. It was 100% references to animal names.

AngledLuffa commented 2 years ago

Excellent, thanks! Do you mean "not" PER or "now" PER? It looks like you are mapping animals to PER in the script

On Tue, Jan 4, 2022 at 3:30 PM Emil Stenström @.***> wrote:

So I did my homework (looking at the actual data) and now have a new version out:

PRS is now PER. It's what everyone else uses.

myth is now PER. It was 99.9% references to gods names, which could be PER.

animal is not PER. It was 100% references to animal names.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/912#issuecomment-1005248565, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWJBLK6TKOTLIGN4I7DUUN7JLANCNFSM5LFCNEOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

AngledLuffa commented 2 years ago

It does okay, I guess... not very good at identifying works of art. Combining animals and people certainly made the people category easier, though.

2022-01-04 18:54:19 INFO: Score by entity:
Prec.   Rec.    F1
86.66   84.69   85.66
2022-01-04 18:54:19 INFO: Score by token:
Prec.   Rec.    F1
86.18   86.13   86.16
2022-01-04 18:54:19 INFO: NER tagger score:
2022-01-04 18:54:19 INFO: sv_suc3shuffle 85.66
2022-01-04 18:54:19 INFO: NER token confusion matrix:
     t\p       O    EVN    LOC    MSR    OBJ    ORG    PER    TME    WRK
         O 161949      7     39     86     20     77     73    427    145
       EVN     22     51      3      0      3     19      2      0      6
       LOC     67      1   1732      0      0     40     30      0      6
       MSR     44      0      0    899      0      0      0     14      0
       OBJ     43      2      4      0     76      7     21      0      2
       ORG    123      0     60      0      3    924     54      0     13
       PER    146      2     22      0      0     16   3515      0     23
       TME    243      0      1     13      0      0      0   3519      0
       WRK    192      1     10      0      0     36     57      4    493

EmilStenstrom commented 2 years ago

Hmm... so maybe we should just ignore works of art? Events seem bad too, even though there are very few such tokens.

AngledLuffa commented 2 years ago

I think they're both fine, even if they're not the best scores. Maybe a better model will come along that fixes it! (For example, we hope to add a transformer to our NER in the near future - it's already been done for the conparser, so this isn't just vaporware)

It takes a while to rebuild the models, but you can expect the dev branch to have "shuffle" and "licensed" by tomorrow morning US time.

EmilStenstrom commented 2 years ago

Thank you so much for this work!

AngledLuffa commented 2 years ago

Alright, it should be the default model downloaded with the dev branch of stanza now

AngledLuffa commented 2 years ago

Now released in 1.4.0

stanfordnlp / stanza

Feature request: Support for Swedish NER #912