Closed EmilStenstrom closed 2 years ago
ping @AngledLuffa which found my project before.
Hi! In fact this is already available in the dev branch, both the shuffled free version based on your code and the official version which requires a license. I can retrain with the improvements to your code, though! Have you thought about turning that into a pip module? That would be more convenient for us long term, although a bit more effort on your end.
How useful would a constituency parser be? We have a smallish dataset for Swedish, but training a model based on that hasn't been a priority until now.
On Mon, Jan 3, 2022 at 3:55 AM Emil Stenström @.***> wrote:
ping @AngledLuffa https://github.com/AngledLuffa which found my project before I did.
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/912#issuecomment-1004042143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWP2ICKQHZ6KGWZSX73UUGFD7ANCNFSM5LFCNEOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
Awesome! From reading the docs, I think the official version is missing the ne tags (?). Anyhow, I think the improvements I made today will improve the quality well enough so that it's worth retraining, thanks! :)
I'll make it a pip module if that's more convenient for you! Let me get back on this.
About a constituency parser: I haven't personally had a use-case for this, but I'm sure there are other Swedes that would appreciate this a lot.
Any idea of when your next release will be? So looking forward to this!
It does appear that there are some more tags in the public dataset which can be merged as you've done, but the official order of docs & the official train/dev/test split is available in the licensed version. Not sure which will be better.
I don't think it's necessary to make a pip module. For now, I simply copied the script into our repo. You can see how that looks here:
https://github.com/stanfordnlp/stanza/pull/913
As for the next release, I'll have to discuss with my PI. What's been happening lately is that generally when people bring us an issue, I explain how to install the dev branch and they go away, so there hasn't been a lot of pressure to get it done. Probably when there's a clear line to be drawn for "here is a finished set of features we want".
pip install git+git://github.com/stanfordnlp/stanza.git@fa17aa55f7343e8743069a78dea09fa366547452
This has two versions of a Swedish NER model: suc3
trained using the tags and splits provided in the licensed version of suc3, and suc3shuffle
, trained using the "name" tags from a random split of the public suc3
dataset using the previous version of your script. I'm currently training a new model using the new version of your script; I can keep that with a separate name for now if you want to compare all three, or I can just get rid of last week's model if you don't care too much about it
I think a pip module makes more sense no matter what. That way I can make improvements to the script over time, and you wouldn't have to copy-paste things over, and instead could just update a requirements file. It's not a lot a work for me, I have several other projects already.
Installing from git is fine for personal projects, but to be able to use this at work being able to reference a real pip package is definitely better. No rush, I was just curious.
I'm sure the official suc3 version is best for evaluation and comparing performance against academic papers. I'm fairly certain that the shuffled version with ne tags is much better in practice. I would set the shuffled version to suc3
, and the official version as suc3-official
, just to steer people in the right direction.
In my view, the official version has several issues that just makes it worse for NER than the shuffled version:
You can definitely get rid over last week's model, since it only used the automated ne tags it was strictly just emulating another automated system. I manually went through the diff between the generated file before and after and all changes are strictly improvements over the last version.
FWIW last week's model used the name tags, not the ne tags. You can see in the stanza-internal PR that I had added a flag to use the name field instead of ne.
Thanks for the feedback! No one else has ever voted on Swedish NER, so I'll just take this advice until someone provides a compelling reason otherwise.
I'll update with results as soon as I have them.
Alright, here's some results...
I wouldn't put too much stock in the difference in scores between the shuffled and official dataset. The shuffled dataset mixes all the documents together, so it is almost definitely an easier dataset because of that.
I will note that the names of the labels themselves are a lot easier to read in the old version of the shuffled or in the licensed dataset. I can handle that on my end, but it might also be a useful mapping to add to the preparation script.
New version of the shuffled dataset:
2022-01-03 16:02:27 INFO: Score by entity:
Prec. Rec. F1
85.79 83.48 84.62
2022-01-03 16:02:27 INFO: Score by token:
Prec. Rec. F1
85.67 84.92 85.29
2022-01-03 16:02:27 INFO: NER tagger score:
2022-01-03 16:02:27 INFO: sv_suc3shuffle 84.62
2022-01-03 16:02:27 INFO: NER token confusion matrix:
t\p O ANI EVN LOC MSR MYT OBJ ORG PRS TME WRK
O 161974 0 7 36 79 3 22 96 66 422 116
ANI 0 24 0 0 0 0 0 3 12 0 0
EVN 27 2 47 5 0 1 3 17 3 0 1
LOC 77 0 1 1707 0 0 0 48 36 0 7
MSR 48 0 0 0 893 0 0 0 0 16 0
MYT 3 0 0 1 0 43 0 0 15 0 0
OBJ 45 2 2 5 0 0 74 5 20 0 2
ORG 124 0 0 63 0 0 10 915 50 0 15
PRS 148 1 2 18 0 2 0 15 3419 0 20
TME 258 0 0 1 11 0 0 0 0 3506 0
WRK 227 0 2 14 0 0 0 43 65 4 438
Old version of the shuffled dataset:
2022-01-03 13:52:23 INFO: Score by entity:
Prec. Rec. F1
84.92 83.50 84.21
2022-01-03 13:52:24 INFO: Score by token:
Prec. Rec. F1
84.87 84.33 84.60
2022-01-03 13:52:24 INFO: NER tagger score:
2022-01-03 13:52:24 INFO: sv_suc3shuffle 84.21
2022-01-03 13:52:24 INFO: NER token confusion matrix:
t\p O animal event inst myth other person place product work
O 167760 0 6 71 4 1 67 34 16 120
animal 1 28 0 0 0 0 9 1 0 0
event 11 2 26 22 1 0 0 0 1 1
inst 81 2 0 977 0 0 41 114 5 16
myth 4 0 0 0 45 0 12 0 1 0
other 20 0 2 26 0 7 14 41 7 9
person 71 1 0 11 1 0 3384 11 3 10
place 31 0 1 71 0 1 22 1304 0 2
product 20 3 0 5 0 0 18 4 75 0
work 126 0 0 25 0 0 50 9 0 522
Licensed dataset:
2022-01-03 16:21:58 INFO: Score by entity:
Prec. Rec. F1
82.13 82.95 82.54
2022-01-03 16:21:58 INFO: Score by token:
Prec. Rec. F1
82.42 82.98 82.70
2022-01-03 16:21:58 INFO: NER tagger score:
2022-01-03 16:21:58 INFO: sv_suc3 82.54
2022-01-03 16:21:58 INFO: NER token confusion matrix:
t\p O event inst myth person place work other product
O 22530 0 11 0 13 12 7 0 0
event 0 0 0 0 0 5 0 0 0
inst 14 0 99 0 6 20 1 0 0
myth 1 0 0 9 3 0 0 0 0
person 10 0 0 0 343 8 0 0 0
place 9 0 16 0 7 172 0 0 0
work 4 0 0 0 4 0 13 0 0
other 0 0 0 0 0 1 0 0 0
product 0 0 0 0 0 0 1 0 0
Is this a mapping I can use? Question marks on the ones I'm not sure about
ANI animal
EVN event
LOC place
MSR product ?
MYT myth
OBJ other ?
ORG inst ?
PRS person
TME time
WRK work
The mapping is actually something I made up. I thought the standard was to use three letter names for tags, and I've seen LOC PRS and ORG used in other datasets? Here are the tags from CoNLL2003's NER tagging workshop for instance: https://www.clips.uantwerpen.be/conll2003/ner/lists/eng.list - Do you agree that these kinds of tags make sense?
Tags are just abbreviations of their english names. I've mapped the name tags to these, to match the ne tags. I did put an explaination of the tags here: https://github.com/EmilStenstrom/suc_to_iob/#available-tags
Happy to see that the new version performs slightly better than the old shuffled version. I wonder if it would be a good idea to also map animal -> person as the annotation guidelines suggest in CoNNL2003, and maybe also merge MSR and TME?
It's kind of an English-centric view, but knowing English, I can look at PER
, LOC
, ORG
and guess pretty quickly what those mean. I don't have any clue what MSR
is, and I can only guess MYT
because I know the original dataset has myth
. I'd go with slightly longer names even if they need to be abbreviated sometimes (inst
instead of institution
) The 18 class models tend to spell out the full names of most of the NEs for the same reason.
I don't have a strong opinion on animal, partly because I'm not familiar with the data. If it's tagging "Shamu", that seems pretty hard to distinguish from person
, but if it's tagging "whale", that might be useful as a separate entity type
So I did my homework (looking at the actual data) and now have a new version out:
Excellent, thanks! Do you mean "not" PER or "now" PER? It looks like you are mapping animals to PER in the script
On Tue, Jan 4, 2022 at 3:30 PM Emil Stenström @.***> wrote:
So I did my homework (looking at the actual data) and now have a new version out:
- PRS is now PER. It's what everyone else uses.
- myth is now PER. It was 99.9% references to gods names, which could be PER.
- animal is not PER. It was 100% references to animal names.
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/912#issuecomment-1005248565, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWJBLK6TKOTLIGN4I7DUUN7JLANCNFSM5LFCNEOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
It does okay, I guess... not very good at identifying works of art. Combining animals and people certainly made the people category easier, though.
2022-01-04 18:54:19 INFO: Score by entity:
Prec. Rec. F1
86.66 84.69 85.66
2022-01-04 18:54:19 INFO: Score by token:
Prec. Rec. F1
86.18 86.13 86.16
2022-01-04 18:54:19 INFO: NER tagger score:
2022-01-04 18:54:19 INFO: sv_suc3shuffle 85.66
2022-01-04 18:54:19 INFO: NER token confusion matrix:
t\p O EVN LOC MSR OBJ ORG PER TME WRK
O 161949 7 39 86 20 77 73 427 145
EVN 22 51 3 0 3 19 2 0 6
LOC 67 1 1732 0 0 40 30 0 6
MSR 44 0 0 899 0 0 0 14 0
OBJ 43 2 4 0 76 7 21 0 2
ORG 123 0 60 0 3 924 54 0 13
PER 146 2 22 0 0 16 3515 0 23
TME 243 0 1 13 0 0 0 3519 0
WRK 192 1 10 0 0 36 57 4 493
Hmm... so maybe we should just ignore works of art? Events seem bad too, even though there are very few such tokens.
I think they're both fine, even if they're not the best scores. Maybe a better model will come along that fixes it! (For example, we hope to add a transformer to our NER in the near future - it's already been done for the conparser, so this isn't just vaporware)
It takes a while to rebuild the models, but you can expect the dev branch to have "shuffle" and "licensed" by tomorrow morning US time.
Thank you so much for this work!
Alright, it should be the default model downloaded with the dev branch of stanza now
Now released in 1.4.0
I was hoping that stanza would be willing to add support for NER in Swedish.
For this, you need a Swedish dataset in IOB2 format. And I have a way to get a file like that using the SUC 3.0 corpus, and a small converter script that I wrote: https://github.com/EmilStenstrom/suc_to_iob
The license for the corpus us CC BY-SA, and it does not affect the license of your model file (see the repo above for licensing clarifications).
How can I help to get this done? :)