Open amir-zeldes opened 2 years ago
For English, we did the best we can to unify the xpos tags features. Same with Italian. That was a bit more difficult, since no one on the project speaks Italian, but the treebank maintainers were quite helpful with that.
If the Hebrew standards are the same across different treebanks, we can mix them together and see what happens. I assume you're talking about IAHLT and HTB?
There's also Hebrew NER and constituency datasets which can be added, if you want to see more Hebrew stuff in general in Stanza. Sentiment dataset here:
https://github.com/OnlpLab/Hebrew-Sentiment-Data
Overall, a bunch of stuff we could add if we took some effort to improve our Hebrew pipeline.
I took a look at the two treebanks.
First thing to note is that they seem to follow similar MWT splitting guidelines, although I don't read Hebrew so I don't know how similar they are. Would you double check that?
I can see the xpos tags are the same as the upos in both treebanks, so no questions there. It appears roughly the same fraction of words are featurized in both sets. However, there are some differences between the two sets:
Feature differences:
In IAHLT, not HTB:
Aspect=Prog
Foreign=Yes
HebBinyan=NITPAEL
NumType=Card
NumType=Ord
Poss=Yes
Tense=Pres
in HTB, not IAHLT:
Case=Tem
HebExistential=Yes
Number=Dual,Plur
Person=1,2,3
Typo=Yes
If this is something you can help unify, we'll be happy to make combined models.
Actually the segmentation guidelines of the old HTB don't match IAHLT, and as you noted there are feature differences. In fact, the HTB in the UD repo hasn't been valid since 2018, so it's in legacy status and doesn't match what's in IAHLT, which is newer. However we do have a revised version of HTB which is valid (at least as of UD2.9) and matches the standards in IAHLTwiki pretty closely. You can find it here:
https://github.com/IAHLT/UD_Hebrew
That should allow for good joint results, combined with https://github.com/universalDependencies/UD_Hebrew-IAHLTWiki
That's interesting. Is there a possibility of upgrading the UD version of HTB to the revised version? I haven't been following developments for those treebanks at all.
I think the earliest that could happen would be in November, based on the guidelines here:
http://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/validation-report.pl
The preamble states that after 4 years in legacy status, an invalid legacy TB will be excluded from the release, so perhaps this would be a chance to propose switching to that fork.
I should be able to make this happen tomorrow.
Fantastic, just let me know once it's available - it's too late for something I needed for an author response, but could still go into a camera ready version... :)
Would have made it a bit more of a priority if I knew there was a clock. It takes a few hours to go from 0 to models, and I was using my time (and my GPU time) to make some progress on the sentiment classifier and parser.
Anyway, I just need to finish up the depparse and it will be ready. (This has to happen after POS because we use the tagger to use predicted tags.) Are there any other models which would be useful, such as NER, constituency, or sentiment?
Ok, if you use stanza dev branch, it should be available now.
I'm gonna make a new release soon with it as default - just need to retrain a few models first
Oh wow, thank you! And no worries about the clock, this was super fast! We can include numbers from this model in our upcoming paper now.
@ivrit @yifatbm @nlhowell @AvnerAlgom - note this means that there will be a joint wiki+HTB Hebrew Stanza model using the new tokenization out of the box
This is part of stanza as of 1.4.1
Do you plan on taking over maintenance of the original HTB (or at least requesting it) when UD 2.11 comes out? Currently this is kind of ad-hoc, needing a couple different items downloaded first before building the models.
This would depend on how the community and previous maintainers feel, but if they don't have the resources to maintain the older fork then yes, I would be willing to maintain the newer one. Having it consistent with the new Wiki-based corpus would be a big plus for Hebrew NLP, and there are more corpora coming for Hebrew in the same scheme.
Gotcha. Perhaps a meaty PR with the updates would at least get the other repo to use the same annotation scheme. Either way, it would make it easier to do things like the combined models
Agreed - I'll bring it up in the countdown to the Nov. release.
@amir-zeldes did you wind up taking over the other Hebrew dataset, or otherwise having it updated to the better tagging scheme?
Also, do you have anything else that would be useful for your usage of Stanza? An NER model, for example, or perhaps a constituency parser built out of SPMRL?
@AngledLuffa There is an updated HTB but it's not perfect. I think Amir has it over his repository. Also, we at IAHLT want to release a fully open version of the new Hebrew dataset with NE-annotations. Currently we only have about 80% which underwent two rounds of QA. I am happy to send it over for training, if that helps.
I can wait for it to be released. This thread is one of the first times people have mentioned using the Hebrew models. Thanks!
Hi @AngledLuffa , no, so far the HTB maintainers have opted to leave it as is, but if that changes I can let you know. My own fork, with tokenization etc. matching the Wiki corpus is still available here and is valid based on the UD validator (or was until recently, it's a moving target..) I can also keep updating it, but maybe I should wait and see what happens with HTB in the next UD release. In any case you should be able to get a good combined model going using that repo + the official UD IAHLTwiki - in fact, we have published scores for that using Stanza in this paper.
Adding NER support would be fantastic, but as Noam mentioned the data is not yet publicly available. A portion of it overlaps the entire IAHLTwiki corpus, so my plan is to merge the NER annotations into that using the same format at English GUM (i.e. the CorefUD/Universal Anaphora conllu format).
I don't suppose constituency trees are in the plans for the new resource? I may be one of the last people to care about that 🤷
Adding NER support would be fantastic
There's other Hebrew NER data out there. For example: https://github.com/OnlpLab/NEMO-Corpus
There is a Hebrew constituent TB over the same material as HTB, but it is very old and I wouldn't count on the tokenization matching either of the UD versions.
The NEMO data is the same text as the old-tokenization HTB (all come from the same ~1990 Ha'aretz newspaper data), though I think it also includes a version which stretches the annotations to the nearest MWT and could probably be projected somehow to either corpus. But the IAHLT standard is different and IMO much better, so I would wait for that (NEMO has some very strange practices such as excluding 'of' PP modifiers, so in ORG or PER like "the State Department/Foreign Minister of Canada" it will always leave out "of Canada", making it much less useful for applications).
Gotcha, thanks for the insight
Did you ever make any progress unifying the different HE treebanks in the UD umbrella? It would make rebuilding the models simpler in the long run.
Also, LMK if there's a need for NER and a dataset I should use
@AngledLuffa we still did not sync the two UD's, but we have a new (soon to be) publicly available Hebrew UD+NER dataset. I'll it to you via email until Amir push it on the next UD release.
Great, thanks! What should I do regarding this dataset and the existing UD datasets? Currently the default Hebrew models for Stanza are built from "UD_Hebrew-IAHLTwiki" and the github.com:IAHLT/UD_Hebrew.git repo. If I add the new data you just sent me, does that overlap either of those sources, or should I use all three?
OK, so in the compressed file I sent you there are additional 4.7k annotated sentences, basically taken from here, only that we (read: Amir) did some extra cleaning and separated it to train/dev/set. So it would be nice to see whether a combined model yields considerably better results.
Sounds good. So basically all three should be disjoint, and I should train with all three and report the results on the various dev & test sets?
And eventually the new data will be integrated with UD, but the git repo I just linked to is not expected to be part of UD any time soon?
wait... went back to take a look, and your message said "superset". so one of the other two datasets is also part of what you sent me?
I think that it would simpler if I just add here the disjoint new dataset. Please find attached the new IAHLTKnesset data with the splits. If I were you, I would train on both UD_Hebrew-IAHLTwiki + and the attached - as the schema is exactly the same. It would be great if could combine it with other datasets and schemas, but it would require more work. UD_Hebrew-IAHLTknesset.zip
Overall I think the results are promising for using the new Hebrew dataset alongside the other two datasets. Tokenization, MWT, Lemmas are all in the same general score range. As an example of how the new data allows for broader coverage, here are some POS and depparse results
For POS, the scores are about the same on the original dev & test sets (the IAHLTwiki dataset), but the coverage is clearly better on Knesset:
pos
orig model dev
UPOS XPOS UFeats AllTags
97.36 97.36 93.35 92.39
orig model test
97.39 97.39 92.03 91.31
new model dev
97.39 97.32 93.32 92.28
new model test
97.41 97.42 92.04 91.20
orig model new dataset test
96.63 95.85 82.93 80.53
new model new dataset test
97.47 96.90 92.78 90.33
For depparse, I would again say the scores are similar (maybe a bit of a dip from adding the new data), but the coverage on the new dataset's test set is clearly better.
depparse
orig model dev
UAS LAS CLAS MLAS BLEX
94.25 92.20 88.99 88.39 88.99
orig model test
94.01 91.65 88.31 87.37 88.31
new model dev
94.18 92.22 89.14 88.50 89.14
new model test
94.02 91.56 87.88 87.06 87.88
orig model new dataset test
89.68 86.46 82.00 81.03 82.00
new model new dataset test
91.99 89.57 85.79 85.16 85.79
I hadn't used the dev set from the new dataset in any way. I'm not sure if it would make more sense to put all three dev sets together, just use one dev set as I am currently doing, or perhaps even use the dev sets from the dataset which isn't the primary scoring metric as additional training data.
At any rate, what do you think? Make the models with the third dataset the default HE models?
In terms of availability of this dataset, is it going to be part of UD 2.15? That would make it easier to maintain the models going forward. Even better would be if the IAHLT standard for the older HE dataset becomes part of UD somehow.
Thanks @ivrit for providing the files, and thanks @AngledLuffa for getting those scores! Very nice to see and I expect even if you don't get gains on within-dataset scores, the resulting model will be much more robust on actually unseen data.
Yes, the Knesset data will be released as part of UD proper - it already has a repo here and I just need to clean up some stuff to push the data to the dev branch. It should be released into master in November.
I agree that using all dev sets to dev is a bit wasteful, it's really too much for dev. Maybe just shuffle all devs document-wise, then take a third from each as dev and append the rest to train for the joint model? I would definitely make the joint model the default, since just using HTB is limiting the parser to newswire only (and from over 30 years ago at that)
Finally about the IAHLT standard, it seems the original HTB maintainers want to keep up the dataset in their standard, so we can't replace it. I'm happy to keep the IAHLT fork up to date with UD developments though, and I'm willing to keep working on maintaining consistency between the fork and the IAHLT datasets.
Eh.... I was kinda lazy and didn't remove the features you mentioned in the offline email. I guess there's a possible problem of it not being consistent given similar words from two different datasets. Will those features be added going forward to the wiki dataset and the rewriting of the original dataset?
Would an NER model from this data be useful? It would appear that annotation layer is only in the knesset dataset. Are there plans to add it to the wiki dataset or the older dataset?
At any rate, models built from all three datasets (but with the features not properly handled) are now the default models for Stanza HE
Will those features be added going forward to the wiki dataset and the rewriting of the original dataset?
I'm not sure - if it's deterministically possible I might take a stab at it, but relatives are probably tricky. Let me take a look, at least for the public UD data.
Would an NER model from this data be useful?
Yes, I think so! There is NER data from NEMO for original HTB, but it's not the same scheme as IAHLT. I'm not sure what the status is for the wiki data, maybe @ivrit can say more about that. A model using just Knesset is definitely better than nothing, but it is the smallest dataset, and rather specific (spoken+domestic politics)
@ivrit any further thoughts on the NER in the various treebanks, and if there will be more annotations available in the future? I can just use the Knesset data for now if there's nothing planned in the near future.
@AngledLuffa I think the Knesset data in itself is worth training a NER model on. However, we do have a version of IAHLT-wiki which is annotated for NE, I am not sure it's perfectly aligned, but I can send you what I have.
Sure, that works, although if the expectation is that it will be part of UD github at some point, perhaps you could just point me to a branch that has the wiki data
I do think the broader coverage would be better for our publicly available NER model
Am I right in these two hypotheses?
also, these are the tags? what do they mean? (I'll post an explanation with our documentation)
ANG DUC EVE FAC GPE LOC MISC ORG PER TIMEX TTL WOA
Is there a citation I should use?
ah, found the answers to the last two questions here:
https://github.com/UniversalDependencies/UD_Hebrew-IAHLTknesset/tree/dev
Will those features be added going forward to the wiki dataset and the rewriting of the original dataset?
I just did a pass to unify FEATS in the three TBs with IAHLT style tokenization, so incl. the HTB fork. You can pull the dev branch from each one to test, hopefully they are a bit more consistent now (but for sure not perfect!)
Am I right in these two hypotheses?
Yes, I believe that's right. I think there were plans to do nested entities but they didn't happen in the end, is that right @ivrit ?
Great. The models should now be part of the default
and default_accurate
model packages, and I put a description of the models here https://stanfordnlp.github.io/stanza/ner_models.html along with the entity F1 for the transformer based NER model
Time to put this aside for a while, or are the treebanks in a stable spot for the next few months and now's a good time to retrain the POS models with the updates FEATS?
As for other packages:
My understanding is constituency wouldn't be super useful for you, and the tokenization standards may be different anyway, so no urgency on converting the SPMRL Hebrew treebank
If there's a sentiment dataset you like, we can add that as well
If there's coreference, we're wrapping up a change to our coreference model to make it multilingual, so we can add that as well probably in a couple weeks @Jemoka
Further thought - in the paper, you compare Stanza to Trankit and your own pipeline. Was that using Stanza with a transformer? I suspect not, since I remember having a discussion about Stanza with Transformers with @amir-zeldes at GURT in 2023.
we do have a version of IAHLT-wiki which is annotated for NE
@ivrit shared the files with me and I just integrated the data into the official Wiki corpus dev branch, so you should be able to see it. The labels are the same as Knesset, but I noticed they actually ARE nested. I'm not sure if they're not in Knesset because it's rare in speech, or if they just skipped nested cases there. In any case, that is a difference between the two datasets - they both only cover named cases though, so nested entities are perhaps not very common.
If joint training is problematic, Wiki might be the better choice for a default, since it's substantially bigger and more general domain.
Time to put this aside for a while, or are the treebanks in a stable spot for the next few months and now's a good time to retrain the POS models with the updates FEATS?
I don't plan on making any other changes ATM, unless someone remembers another forgotten annotation layer or volunteers to do a big manual overhaul of something 😅
constituency wouldn't be super useful for you
No, not right now for Hebrew, though one of my students just used the English constituency model in Stanza for a project, thanks for making it available!
If there's coreference, we're wrapping up a change to our coreference model to make it multilingual, so we can add that as well probably in a couple weeks
That sounds exciting! I think IAHLT planned to make some coref data but I don't know how far they got (@ivrit ?) There was also a student of Reut's who was working on Hebrew coref (Shaked I think?) so you could ask her as well.
in the paper, you compare Stanza to Trankit and your own pipeline. Was that using Stanza with a transformer?
No, that paper is from 2022, so the Stanza was definitely non-transformer, though Trankit was using the default XLM model. The biggest problem with both systems was tokenization performance - it wasn't huge for tagging if you supplied gold segmentation, though the native Hebrew transformer did do a bit better than XLM IIRC.
Sorry for the late replies... I hope I can refer to everything:
On Tue, Jul 23, 2024 at 5:24 PM Amir Zeldes @.***> wrote:
we do have a version of IAHLT-wiki which is annotated for NE
@ivrit https://github.com/ivrit shared the files with me and I just integrated the data into the official Wiki corpus dev branch, so you should be able to see it. The labels are the same as Knesset, but I noticed they actually ARE nested. I'm not sure if they're not in Knesset because it's rare in speech, or if they just skipped nested cases there. In any case, that is a difference between the two datasets - they both only cover named cases though, so nested entities are perhaps not very common.
If joint training is problematic, Wiki might be the better choice for a default, since it's substantially bigger and more general domain.
Time to put this aside for a while, or are the treebanks in a stable spot for the next few months and now's a good time to retrain the POS models with the updates FEATS?
I don't plan on making any other changes ATM, unless someone remembers another forgotten annotation layer or volunteers to do a big manual overhaul of something 😅
constituency wouldn't be super useful for you
No, not right now for Hebrew, though one of my students just used the English constituency model in Stanza for a project, thanks for making it available!
If there's coreference, we're wrapping up a change to our coreference model to make it multilingual, so we can add that as well probably in a couple weeks
That sounds exciting! I think IAHLT planned to make some coref data but I don't know how far they got @.*** https://github.com/ivrit ?) There was also a student of Reut's who was working on Hebrew coref (Shaked I think?) so you could ask her as well.
in the paper, you compare Stanza to Trankit and your own pipeline. Was that using Stanza with a transformer?
No, that paper is from 2022, so the Stanza was definitely non-transformer, though Trankit was using the default XLM model. The biggest problem with both systems was tokenization performance - it wasn't huge for tagging if you supplied gold segmentation, though the native Hebrew transformer did do a bit better than XLM IIRC.
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1109#issuecomment-2245399413, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYYBCCAZC3LJMVSOZKM42LZNZRRHAVCNFSM6AAAAABJXY7KQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBVGM4TSNBRGM . You are receiving this because you were mentioned.Message ID: @.***>
I actually think we are better with a flattened representation of entities as more often than not nested entities don't carry important notional concepts, as you'd find in medical texts.
It should be pretty easy to drop the nested entities, then.
I agree with Amir regarding the crucial step of tokenization; such errors percolate and misinform all other layers of annotations. Some years back, I tried to override stanza's tokenizer with Amir's RFTokenizer, but failed. I think performance will rocket if you do it.
This sounds like an important step to get right. I found this: https://github.com/amir-zeldes/RFTokenizer
but how to do sentence segmentation once the words are segmented? We don't have a separate generic sentence model (yet... there's been talk about needing to do that for several different languages)
It should be pretty easy to drop the nested entities, then.
It seems sad to throw out this information... Personally I think I'd prefer a model with nested entities trained just on the Wiki data, if it came down to that. Or you could try to predict on Knesset with a Wiki trained model, throw out any sentence where nested NER is predicted, and keep the others and concatenate to Wiki train (so Knesset would provide more examples for sentences which happen to probably not contain nesting).
RFTokenizer ... how to do sentence segmentation once the words are segmented?
If you need any help with running RFTokenizer let me know - I think its accuracy is good, though it's not as fast as Stanza and as you say, it needs a second pass for sentence splitting. The way we solved it in the 2022 paper was to then train a binary classifier on the resulting tokens and deciding which one was a sentence boundary. FWIW we also used position within MWT as a dense embedding (BIO-encoded), which helped prevent splits mid-word and worked better than just binary classifying MWTs.
Normally the nesting looks like this, which is easily processed:
(WOA
(GPE)WOA)
Sometimes this happens instead:
(TTL)(PER
PER)
That would require a bit more logic to get right. Can I request a reordering of these things?
It seems sad to throw out [nesting] information...
Honestly we don't really do nesting right now. There's some mechanisms for multiple output layers, but ultimately only one layer gets used as the final labels from a model. We were using that for finegrained annotations. Nesting should be a future extension
There is a case of (PER)(PER)
as a label. Is that meaningful?
# newpar
# sent_id = iahltwiki_judea-samaria-145
14 ו ו CCONJ CCONJ _ 15 cc _ _
15 הרש הרש PROPN PROPN _ 13 conj _ Entity=(PER)
16 , , PUNCT PUNCT _ 17 punct _ _
17 קחצי קחצי PROPN PROPN _ 13 conj _ Entity=(PER)(PER)
18-19 הקברו _ _ _ _ _ _ _ _
18 ו ו CCONJ CCONJ _ 19 cc _ _
also a TIMEX)(TIMEX
in
# sent_id = iahltwiki_hakrav-al-husheniya-1973-22
# text = חפנ בחרמ לע טלתשהלו ילארשיה חטשה קמועל רודחל ידכ ,רבוטקואב 7/6 לילב ןלוגה תמר לש תיזכרמה הרזגל ירוסה דוקיפה ידי לע התנפוה ,םיפסונ םיקנט 250-כ הללכש ,הנושארה תניירושמה היזיווידה.
also a (PER)(PER)
in
# sent_id = iahltwiki_arik-einstein-206
# text = (יראכוב/דנלרוא) "ןומירה ץע"ו (סוארק/ןניק סומעו רפח) "הברעה תיב" ,(יממע/רפח) "םיסיריא" ויה םובלאב םירישה ןיב.
I started to correct some of these, but then I remembered that for en_gum, you have a separate repo and PRs against the UD repo are not useful. Is the same true of this repo and its Entity labels? Perhaps I should not spend any more time on the reordering project.
I saw the great idea for combined models here:
https://stanfordnlp.github.io/stanza/combined_models.html
Is there a process to request more of these? Specifically I was thinking of Hebrew right now.