stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.14k stars 880 forks source link

Proiel parser exhibits odd behaviour with respect to punctuation #1311

Open pseudomonas opened 7 months ago

pseudomonas commented 7 months ago

Describe the bug

If there is a comma in the parsed sentence, the PROIEL model:

a) does not tokenize the comma, it just bundles it with the preceding word. The lemma is affected similarly. b) if the comma is space-delimited, it does unpredictable (to me!) things up to and including tagging it as a verb with a lemma of ὁράω.

The fullstop/period is correctly tokenized, but is still never identified as punctuation. There does not seem to be any POS tag corresponding to punctuation emitted by the PROIEL model; the full list of tags on parsing a corpus is ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM PRON PROPN SCONJ VERB .

To Reproduce

import stanza
perseus = stanza.Pipeline('grc', processors='tokenize,pos,lemma', package="perseus")
proiel = stanza.Pipeline('grc', processors='tokenize,pos,lemma', package="proiel")

sent = "Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος." # John 1:1, Nestlé 1904 edition of the New Testament

print(perseus(sent))
# Correct output (only relevant tokens shown here)
# {
#     "id": 6,
#     "text": ",",
#     "lemma": ",",
#     "upos": "PUNCT",
#     "xpos": "u--------",
#     "start_char": 18,
#     "end_char": 19
# }

# [...]

# {
#     "id": 20,
#     "text": ".",
#     "lemma": ".",
#     "upos": "PUNCT",
#     "xpos": "u--------",
#     "start_char": 69,
#     "end_char": 70
# }

print(proiel(sent))
# Comma not separated from the preceding word
# {
#     "id": 5,
#     "text": "Λόγος,",
#     "lemma": "Λόγος,",
#     "upos": "PROPN",
#     "xpos": "Ne",
#     "feats": "Case=Nom|Gender=Masc|Number=Sing",
#     "start_char": 13,
#     "end_char": 19
# }

# Fullstop parsed as an adverb
# {
#     "id": 6,
#     "text": ".",
#     "lemma": ".",
#     "upos": "ADV",
#     "xpos": "Df",
#     "start_char": 69,
#     "end_char": 70
# }

sent_with_space_before_comma = "Ἐν ἀρχῇ ἦν ὁ Λόγος , καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν , καὶ Θεὸς ἦν ὁ Λόγος."
print(proiel(sent_with_space_before_comma))

# Comma is now a token by itself, but is not identified as punctuation.
# {
#     "id": 6,
#     "text": ",",
#     "lemma": "ἤ",
#     "upos": "CCONJ",
#     "xpos": "C-",
#     "start_char": 19,
#     "end_char": 20
# },

# The second comma is also wrong, but different.
# {
#     "id": 1,
#     "text": ",",
#     "lemma": "ὁ",
#     "upos": "NOUN",
#     "xpos": "Nb",
#     "feats": "Case=Voc|Gender=Masc|Number=Sing",
#     "start_char": 50,
#     "end_char": 51
# },

Results from those commas that were somehow parsed individually, from parsing the text of the Nestlé 1904 edition of the New Testament. They have been passed through sort -u to deduplicate them.

Text    Lemma    POS     Features
,   ἤ   INTJ    None
,   Ἤ   PROPN   None
,   Ἤ   PROPN   Number=Sing
,   ὁ   NOUN    Case=Acc|Gender=Masc|Number=Sing
,   ὁ   NOUN    Case=Dat|Gender=Fem|Number=Plur
,   ὁ   NOUN    Case=Dat|Gender=Fem|Number=Sing
,   ὁ   NOUN    Case=Dat|Gender=Masc|Number=Sing
,   ὁ   NOUN    Case=Gen|Gender=Fem|Number=Sing
,   ὁ   NOUN    Case=Gen|Gender=Masc|Number=Sing
,   ὁ   NOUN    Case=Nom|Gender=Fem|Number=Plur
,   ὁ   NOUN    Case=Nom|Gender=Fem|Number=Sing
,   ὁ   NOUN    Case=Nom|Gender=Masc|Number=Sing
,   ὁ   NOUN    Case=Voc
,   ὁ   NOUN    Case=Voc|Gender=Fem|Number=Sing
,   ὁ   NOUN    Case=Voc|Gender=Masc|Number=Sing
,   ὁ   NOUN    Case=Voc|Number=Sing
,   ὁ   NOUN    Gender=Fem|Number=Sing
,   ὁ   NOUN    Gender=Masc|Number=Sing
,   ὁ   NOUN    None
,   ὁ   NOUN    Number=Sing
,   ὁράω    VERB    Aspect=Perf|Mood=Imp|Number=Sing|Person=2|Tense=Past|VerbForm=Fin|Voice=Act
,   ὁράω    VERB    Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Fin|Voice=Act
,   ὁράω    VERB    Mood=Imp|Number=Sing|Person=2|Tense=Past|VerbForm=Fin|Voice=Act
,   ὁράω    VERB    Mood=Imp|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act
,   ὁράω    VERB    Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Fin|Voice=Act
,   ὁράω    VERB    Number=Plur|Tense=Pres|VerbForm=Fin|Voice=Act
,   ὅς  ADJ Case=Dat|Degree=Pos|Gender=Masc|Number=Sing
,   ὅς  PRON    Case=Dat|Gender=Masc|Number=Sing|Person=1|PronType=Prs
,   ὅς  PRON    Case=Dat|Gender=Masc|Number=Sing|Person=2|PronType=Prs

## Fullstops
.   .   ADV None
.   ἤ   SCONJ   None
.   ὁ   PRON    Case=Dat|Gender=Masc|Number=Plur|PronType=Prs
.   ὁ   PRON    Case=Dat|Gender=Masc|Number=Sing|Person=2|PronType=Prs
.   ὁ   PRON    Case=Dat|Gender=Masc|Number=Sing|Person=3|PronType=Prs
.   ὁ   PRON    Case=Dat|Gender=Masc|Number=Sing|PronType=Prs
.   ὁ   PRON    Case=Dat|Gender=Masc|Number=Sing|PronType=Rel

Expected behavior As Perseus: commas should be tokenised separately from the preceding word; both commas and fullstops should be annotated as punctuation

Environment (please complete the following information):

AngledLuffa commented 7 months ago

Certainly this sucks, but the problem here is with the training data, and I'm not sure how we can fix it. The PROIEL dataset has zero (!) instances of either commas or periods.

One thing I just found is that the Perseus dataset has commas and a period analog which appears to be halfway up the line of text compared to a US period. For example, the first sentence looks like

# text = ἐρᾷ μὲν ἁγνὸς οὐρανὸς τρῶσαι χθόνα, ἔρως δὲ γαῖαν λαμβάνει γάμου τυχεῖν·

It would appear the XPOS tags are not remotely similar, but perhaps you could take a look to see if the general annotation quality is similar. Are the tokenization, lemmatization, dependency standards the same... we could probably mix the two if they are, or maybe you'd just get better results from switching to Perseus

pseudomonas commented 7 months ago

The PROIEL dataset has zero (!) instances of either commas or periods.

I wondered if this had been the case, indeed. Which is odd given that PROIEL included biblical edition text.

I haven't looked into how feasible it would be to interconvert the treebanks and train on a mixture of both the sources. Or to use one of the sources as a pre-training task but not a fine-tuning task, assuming that the stanza models behave like other language models in this regard. So far I've used them as black box algorithms.

pseudomonas commented 7 months ago

I am indeed now using Perseus — but especially since PROIEL is the default package in stanza for Ancient Greek, I thought this was worth noting.

pseudomonas commented 7 months ago

@AngledLuffa Reading the docs at https://stanfordnlp.github.io/stanza/new_language.html it looks like unlabelled text is only good for improving NER/Sentiment/Constituency parsing and not for any of the tasks I'm using (tokenize, lemma, POS, depparse). Is that actually the case?

AngledLuffa commented 7 months ago

I would say that if the other annotations are of similar formalities, they would wind up benefitting the model by giving it more words it knows about and/or examples of unusual phenomenon.

The small things I need to do in a short amount of time are kinda adding up, but long term I do think switching the default to Perseus and then exploring using data from both to make a "combined" model is probably the best approach here.

pseudomonas commented 7 months ago

I feel like in the long run it would be nice to be able to put a standard-architecture language model in there and have the stanza training script do the fine-tuning on that. I'm thinking especially of things like dbamman/latin-bert here (Latin is also a language that I need to support).

AngledLuffa commented 7 months ago

We actually do exactly that for some languages, with the default-accurate language model, although the transformers didn't fit in with the tokenizer or lemmatizer architecture easily. I even found a transformer for Ancient Greek:

https://huggingface.co/pranaydeeps/Ancient-Greek-BERT

(feel free to suggest other options)

If you want, I can give that a try with Ancient Greek, but again, I'm up to my ears in small things that need doing and can't really commit to doing it for a few weeks.

AngledLuffa commented 7 months ago

other options:

https://huggingface.co/lgessler/microbert-ancient-greek-m https://huggingface.co/lgessler/microbert-ancient-greek-mx https://huggingface.co/lgessler/microbert-ancient-greek-mxp

https://huggingface.co/altsoph/bert-base-ancientgreek-uncased

These couple have no description in the model card, which is kinda sus:

https://huggingface.co/niksss/Ancient-Greek-BERT-finetuned-wikitext2 https://huggingface.co/Sonnenblume/bert-base-uncased-ancient-greek-v4

pseudomonas commented 7 months ago

Well, I could give it a whirl if you can point me at docs on how to do the fine-tuning and plumbing it into the system; this is stuff I need for work so I feel like I should at least try to contribute!

I'm aware of the microbert models; they're nice and fast to train (and they're what I'm working on using for Coptic), so if they work, this would be generally applicable to most of the stanza languages.

AngledLuffa commented 7 months ago

Basically you just need to go through the retraining instructions with the flags --use_bert --bert_model ...

I actually found in some limited experiments that finetuning the transformer itself for POS didn't help given the complexity of the inference head we use. We've had some recent success anyway finetuning for constituency parsing or coref with LoRA or with careful experimentation for the finetuning method. However, the calendar for expanding that to other models is "after I get out from under this crushing TODO list" or "after I can scam an undergrad @Jemoka into doing it"

Jemoka commented 7 months ago

Hello, I am that undergrad and I'd love to look into it this weekend. @pseudomonas, @AngledLuffa do you think I can be more helpful starting with—

  1. trying to create a "combined" model with the old architecture combining both datasets—which should give very good performance, but we won't get a Bert out of the system?
  2. explore the transformer Bert embedding situation applying our work for LoRA + Coref for Greek and try to fine-tune a transformer to downstream Greek tasks?

As @AngledLuffa said, Bert support is pretty good but I don't think has been done for this area yet. Though, if one of the two packages work, perhaps it will be more interesting to look into training/LoRAing a Transformer on the task instead of getting a better model simply by combining the two sets.

AngledLuffa commented 7 months ago

@Jemoka I was thinking refactoring the usage of Peft and giving it a try on the POS or depparse would both be interesting and useful, especially once we wrap up the Coref usage of Peft

Certainly as a baseline, switching to Perseus and experimenting with a few of the above models to see which works best would give a better model for short term usage

pseudomonas commented 7 months ago

do you think I can be more helpful starting with

Combining the treebanks seems like, if it can be done, it will provide benefits; and a BERT can presumably be added on top of that at a later point. But I don't know how compatible the annotation guidelines of the two projects are.

pseudomonas commented 7 months ago

@Jemoka I think in terms of improving performance longer-term across Stanza, being able to leverage BERT-integration would be good. I'm probably going to try @AngledLuffa's suggestion https://github.com/stanfordnlp/stanza/issues/1311#issuecomment-1828961531 in any case. I'm not sure how this corresponds (either in terms of performance or in terms of mechanism) to fine-tuning a BERT to perform the task directly.

Jemoka commented 7 months ago

Sounds good. @pseudomonas Feel free to start with the Bert work there, and I can start on the PEFT a large model end that @AngledLuffa mentioned and do Greek POS first as a test case. And hopefully you can end up with a good model in the short term and we can release an adapter that performs even better in the long term.

LMK if you run into anything with Bert tuning.

pseudomonas commented 7 months ago

@AngledLuffa if I'm training a model and the training is interrupted, what's the command-line flags for "resume training starting with this saved checkpoint?"

AngledLuffa commented 7 months ago

If it's giving you the message that the model already exists, you can overwrite the existing model with --force. I haven't added resuming from a checkpoint for POS because it only takes a couple hours to retrain the whole model anyway. It's somewhere on the TODO list, though...

AngledLuffa commented 7 months ago

I'll have results later this morning for the Perseus POS trained on a few different Ancient Greek transformers. I can also do the same thing for depparse, and there's even time to include those models in the upcoming 1.7.0 release. I don't have time over this weekend to built a pretrained charlm (probably from something like https://figshare.com/articles/dataset/The_Diorisis_Ancient_Greek_Corpus/6187256) but that can be an action item for later.

Jemoka commented 7 months ago

@AngledLuffa I will start over the weekend to PEFT for POS and depparse taking a hopefully good pretrained Bert as a starting point. Once you explore some ancient greek transformers don't hesisate to lmk what you would recommend; I will also dig into this a little later on my own.

pseudomonas commented 7 months ago

it took my little computer over a day to reproduce the benchmark, so I might try running the BERT one on my work's cluster with GPUs…

Jemoka commented 7 months ago

yes, running on GPUs would make this process a lot faster; also, the upcoming work of PEFT (in theory, results/benchmarks TBD) should also make inference a smidge faster because its multiplying less parameters.

AngledLuffa commented 7 months ago

So far, I would say the pranaydeeps model improves scores the most, but I will give a full report after a few more model trainings

AngledLuffa commented 7 months ago

As listed above, there's a few Ancient Greek transformers available on HF. Here are the dev scores on the POS & depparse tasks

    #    Model           POS        Depparse LAS
    # None              0.8812       0.7684
    # Microbert M       0.8883       0.7706
    # Microbert MX      0.8910       0.7755
    # Microbert MXP     0.8916       0.7742
    # Pranaydeeps Bert  0.9139       0.7987

https://huggingface.co/altsoph/bert-base-ancientgreek-uncased

could not use because of this error:

https://huggingface.co/altsoph/bert-base-ancientgreek-uncased/discussions/2

So based on those scores, I made the pranaydeeps model the default_accurate package. That will be available as part of the 1.7.0 release... I suppose we can even make a sneak peak of that available now

https://test.pypi.org/project/stanza/1.7.0/

AngledLuffa commented 7 months ago

You will probably want to use a GPU for the default_accurate package, btw.

My takeaway from the rest of this thread is that there are a few separate directions for improvement still:

At any rate, I don't think any of these are immediate TODOs, so hopefully we've improved the situation enough for now and we can leave the issue open in anticipation of future improvements.

pseudomonas commented 7 months ago

Your baseline scores (Model==None) are rather higher than those on https://stanfordnlp.github.io/stanza/performance.html assuming that POS is XPOS rather than UPOS; that page has UPOS = 92.41; XPOS = 85.13 ; LAS=73.97.

AngledLuffa commented 7 months ago

Those are test scores, these are dev scores. Didn't seem fair to pick a model based on how good they do on the test set

The POS score is weighted of upos, xpos, and feats

Jemoka commented 7 months ago

@AngledLuffa I wonder if we trained the model with our new fangles EoS Punct augmentation it will also do better even on the persus dataset

AngledLuffa commented 7 months ago

I believe that is the default now

On Sun, Dec 3, 2023, 12:27 PM Houjun Liu @.***> wrote:

@AngledLuffa https://github.com/AngledLuffa I wonder if we trained the model with our new fangles EoS Punct augmentation it will also do better even on the persus dataset

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1311#issuecomment-1837590627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWPTQ6IV4EJD2YO2IULYHTODXAVCNFSM6AAAAAA74MYBL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXGU4TANRSG4 . You are receiving this because you were mentioned.Message ID: @.***>

pseudomonas commented 6 months ago

I've found a different but related issue with both perseus and proiel parsers, which is that they perform incredibly badly with accents stripped out (they do things like processing definite articles and the most common adverbs as nouns).

Is there a way of using the data augmentation that is used to make them tolerant of line-final punctuation to make them tolerant of absence of accents? My use-case for the parsers is processing manuscripts that lack accents.

The code I'm using is just

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
                   if unicodedata.category(c) not in ('Mn'))

though this might want some refinement so that iotas-subscript are randomly either removed or replaced by a normal iota.

I'm also wondering about whether the data being unicode-decomposed before training would help it generalise.

AngledLuffa commented 5 months ago

I can see how that would be a problem. However, how correct will be able to make it if we use a pretrained embedding or even a transformer? The tokens / tokenizer will have the accents as well, I would think. What about cases where multiple different texts with accents map to the same text without accents?

Nevertheless, if you think it will help, I don't see any reason we can't provide a model like that using the augmentation mechanism.

pseudomonas commented 5 months ago

Good points! I should first try it out with one of the transformer models and see if that provides enough experience of unaccented texts to cause it to generalise.

AngledLuffa commented 5 months ago

I can also train the entire thing with your conversion, then see how its scores are doing. If there isn't a big dropoff, then I guess no reason not to do a model with that conversion for GRC

pseudomonas commented 5 months ago

If you've got the time and computational resources to do that, it would certainly be appreciated!

AngledLuffa commented 5 months ago

Minor point to be aware of, the conversion sometimes makes words completely empty in the Perseus training set.

If I use just the word vectors, the model trained on no accents gets the following dev score:

   UPOS    XPOS  UFeats AllTags
  92.48   85.29   90.44   84.86

Its performance on the accented version is suitably horrible:

grc_perseus
   UPOS    XPOS  UFeats AllTags
  45.11   33.19   57.99   32.84

The original does better than this:

   UPOS    XPOS  UFeats AllTags
  94.29   88.38   92.83   88.12

and the original has a similar huge dropoff in quality when used on the non-accented data:

   UPOS    XPOS  UFeats AllTags
  47.21   32.99   46.18   32.53

I can try training a model on a straight mix of both accented and unaccented, then see where that gets us

Jemoka commented 5 months ago

is this due to the fact that the accents have no morpheme-level learned by BPE (yet?) As in—the model basically treats accented versions as individual characters, and so we see catastrophic forgetting of the originial embedding

pseudomonas commented 5 months ago

I would hazard a guess that running a Unicode decomposition before training would help it learn the relationship between accented and unaccented letters.

AngledLuffa commented 5 months ago

is this due to the fact that the accents have no morpheme-level learned by BPE (yet?)

That was just using word vectors, not the various transformers

AngledLuffa commented 5 months ago

If I train on both, there's actually a noticeable dropoff in the dev score vs. the original dataset:

   UPOS    XPOS  UFeats AllTags
  93.89   87.26   91.86   86.87

I guess one thing that might help would be to use the word vectors for the words with diacritics in place of the word vectors for words w/o when that w/o word doesn't exist.

Would you explain a bit more why this is necessary? Under what circumstances is this relevant to the processing? Always, or is it just that some domains have this problem?

Also, should we be experimenting with this for all of the annotators, not just POS?

One possibility would be to provide versions of the Perseus parser w/ and w/o the unaccented words

pseudomonas commented 5 months ago

Would you explain a bit more why this is necessary?

I'm processing transcriptions of manuscripts that lack diacritics information (in most cases the manuscript lacks; in some cases the manuscript has but it has not been transcribed).

There's a wrinkle in that iotas subscript are in some manuscripts omitted (so τῳτω) and sometimes expanded (τῳτωι). This is due to the manuscripts being from several centuries across a wide geographical area.

Also, should we be experimenting with this for all of the annotators, not just POS?

My use case is certainly to use features beyond POS, including dependency relationships.

I'm looking into whether the default_accurate package generalises across presence/absence of accents in a useful way. It certainly seems like the sort of thing I'd expect a LM to bring to the table.

AngledLuffa commented 5 months ago

The accuracy is enough of a hit that I wouldn't want to make this the default for general usage, but I can see making it available as an optional package.

I'll run some tests on the transformer models as well.

I wonder if diacritic restoration would be a worthwhile project.

AngledLuffa commented 5 months ago

If I train (the transformer POS model) on no-diacritics, I get this:

with:
2024-01-14 21:41:51 INFO: UPOS  XPOS    UFeats  AllTags
2024-01-14 21:41:51 INFO: 85.18 76.94   86.02   76.01

without:
   UPOS    XPOS  UFeats AllTags
  95.74   90.77   94.09   90.43

Trained on both, I get

dev with
   UPOS    XPOS  UFeats AllTags
  95.67   90.76   94.47   90.50

dev without
   UPOS    XPOS  UFeats AllTags
  95.48   90.21   93.94   89.94

Trained on just the dataset with accents:

dev score with:
   UPOS    XPOS  UFeats AllTags
  96.26   91.57   94.74   91.39

dev scores without:
   UPOS    XPOS  UFeats AllTags
  92.91   85.92   90.51   84.99

So, no matter what, there's some kind of hit in quality. Maybe the existing default_accurate is already good enough, or we could add either the combined or the no-diacritics versions of these models as an additional package.