stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.25k stars 888 forks source link

Incorrect Spanish verb decomposition #1395

Closed busdriverbuddha closed 3 weeks ago

busdriverbuddha commented 3 months ago

Describe the bug The token "decírselo" is incorrectly decomposed into the words "decar", "se", "lo". "Decar" is not a word in Spanish. It should be "decír".

To Reproduce

import stanza
nlp = stanza.Pipeline("es", processors="mwt,tokenize")
doc = nlp('Decírselo.')
print(", ".join(w.text for w in doc.sentences[0].tokens[0].words))

This yields the output

Decar, se, lo

Expected behavior The expected output is

Decír, se, lo

Environment (please complete the following information):

Additional context None at the moment.

AngledLuffa commented 3 months ago

Thanks, that's a useful observation. We can add it to the training data for MWT, and I'll have a new model ready probably Monday or so.

If you find others between now and then, please let us know and I'll add those as well.

busdriverbuddha commented 3 months ago

I had similar issues with "decírmelo" (decír+me+lo) and "dárselo" (dar+se+lo, without the accent).

Perhaps it would be interesting to add to the training data a variety of verbs with similar construction

EDIT: Please disregard this comment as it is incorrect.

AngledLuffa commented 3 months ago

I'm trying to figure out - why would the tokenized dar not have an accent, but decír does?

Generally speaking, the GSD treebank we base the Spanish models from doesn't have accents on any of the uses of decir. The unique factor here is there are no decir plus two clitics in the original training data, and generally words with the accent get tokenized so they still have the accent. For example,

# sent_id = es-train-003-s271
# text = Jacob, desempleado por una discusión que tuvo con Bretton James, y sabiendo que Winnie está esperando un hijo suyo, decide persuadir a Winnie de liberar el fideicomiso, para depositárselo a Gordon Gekko quien le ha prometido usarlos para consolidar una fortuna para Winnie y él.
33-35   depositárselo   _       _       _       _       _       _       _       _
33      depositár       depositar       VERB    _       VerbForm=Inf    28      advcl   _       _
34      se      él      PRON    _       Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes      33      expl:pv _       _
35      lo      él      PRON    _       Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs     33      obj     _       _

so my thinking is that dárselo also splits with the accent on dar, even though it isn't the standard way of writing dar when it doesn't have clitics

busdriverbuddha commented 3 months ago

@AngledLuffa You're correct, I'm sorry. Decírselo would indeed be decomposed as decir+se+lo, without the accent, as you mentioned.

The only actual error made by Stanza, then, is the first one I pointed out: decírselo as decar-se-lo which should be decir-se-lo (without the accent, as you pointed out).

AngledLuffa commented 3 months ago

Actually it seems the standard in GSD is to keep the text intact but remove the accents in the lemma.

Some part of me wonders if that means we can deterministic split all the words aside from a few known exceptions. We found that for English, there are no exceptions at all, so we split all words into the raw text. Could do the same thing with exceptions for Spanish

On Sat, Jun 15, 2024, 6:00 AM Guilherme Gama @.***> wrote:

@AngledLuffa https://github.com/AngledLuffa You're correct, I'm sorry. Decírselo would indeed be decomposed as decir+se+lo, without the accent, as you mentioned.

The only actual error made by Stanza, then, is the first one I pointed out: decírselo as decir-se-lo (without the accent, as you pointed out).

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1395#issuecomment-2169542545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIJWZJULARJBPDPH33ZHQ3HTAVCNFSM6AAAAABJIT3F2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRZGU2DENJUGU . You are receiving this because you were mentioned.Message ID: @.***>

busdriverbuddha commented 3 months ago

Well, that's certainly a valid line of investigation, and I wish I could contribute further, but unfortunately I don't actually speak Spanish. I use the Stanza constituency tree parser as part of a larger application which requires a constituency tree where the actual tokens are leaves, not the words, so the application reconstructs the tree, but merging the MWT into single leaves, which is how I caught the initial discrepancy in the first place.

AngledLuffa commented 3 months ago

Hey, ran into an issue or two with the Spanish GSD dataset. Once we get that cleaned up I'll retry the models tomorrow. Guess it's time to put on my UD annotator hat...

AngledLuffa commented 3 months ago

Gah, I apologize for how long this is taking. So in the one treebank I was looking at, GSD, the tokens keep the accents after splitting. The same is true in PUD. In AnCora, though, which we were treating as the default, your observation that the accents disappear is correct.

Do you have a preference? I don't care too much either way.

https://github.com/UniversalDependencies/UD_Spanish-AnCora/issues/9

busdriverbuddha commented 3 months ago

Hi! I have no preference either, and there's also no urgency on my end - I've written a patch here to just ignore the spellings of the words and use the full token when merging, so the problem is solved as far as I'm concerned.

On Wed, Jun 19, 2024, 3:51 AM John Bauer @.***> wrote:

Gah, I apologize for how long this is taking. So in the one treebank I was looking at, GSD, the tokens keep the accents after splitting. The same is true in PUD. In AnCora, though, which we were treating as the default, your observation that the accents disappear is correct.

Do you have a preference? I don't care too much either way.

UniversalDependencies/UD_Spanish-AnCora#9 https://github.com/UniversalDependencies/UD_Spanish-AnCora/issues/9

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1395#issuecomment-2177882940, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBPVXS6XHK7VNUEWNXLIATZIES6NAVCNFSM6AAAAABJIT3F2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZXHA4DEOJUGA . You are receiving this because you authored the thread.Message ID: @.***>

AngledLuffa commented 3 months ago

Got a bunch of data improvements from the UD team. I added the examples above and recreated all of the Spanish models as combined models with both AnCora and GSD. There's a bit of a performance hit when POS tagging the AnCora dev set, which we can investigate some. Otherwise, it seems to be working. I pushed those models as the new defaults for Spanish

AngledLuffa commented 3 weeks ago

Now available on 1.9.0