stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.26k stars 889 forks source link

bracket chars not separated as punctuation in Spanish #1257

Open psnider opened 1 year ago

psnider commented 1 year ago

Describe the bug When Stanza for Spanish processes square brackets containing text, the brackets are not recognized as punctuation. Instead, they remain attached to the text they were next to.

The brackets are used in a transcript to indicate who is speaking.

To Reproduce

from stanza.pipeline.core import Pipeline

nlp_es_parse = Pipeline(lang="es", processors="tokenize")
s="[Daniel Alarcón]: Esto es Radio Ambulante, desde NPR. Soy Daniel Alarcón."
res=nlp_es_parse(s)

the first two entries (simplified) of res are:

    {
      "id": 1,
      "text": "[Daniel",
    },
    {
      "id": 2,
      "text": "Alarcón]",
    },

Expected behavior

I expect similar behavior as for English:

nlp_en_parse = Pipeline(lang="en", processors="tokenize")
res=nlp_en_parse(s)

gives:


    {
      "id": 1,
      "text": "[",
    },
    {
      "id": 2,
      "text": "Daniel",
    },
    {
      "id": 3,
      "text": "Alarcón",
    },
    {
      "id": 4,
      "text": "]",
    },```

**Environment (please complete the following information):**
 - MacOS Monterey 12.1 (21C52) 
 - Python 3.9.7 from Anaconda3 
 - stanza-1.5.0
 - java version "1.8.0_371" 

**Additional context**
none
AngledLuffa commented 1 year ago

The probably unsatisfying answer is, all of the brackets in ancora are round. We could probably detect that and teach the tokenizer to recognize square the same as round. I'll do that tomorrow

On Wed, Jun 7, 2023, 4:15 PM psnider @.***> wrote:

Describe the bug When Stanza for Spanish processes square brackets containing text, the brackets are not recognized as punctuation. Instead, they remain attached to the text they were next to.

The brackets are used in a transcript to indicate who is speaking.

To Reproduce

from stanza.pipeline.core import Pipeline

nlp_es_parse = Pipeline(lang="es", processors="tokenize") s="[Daniel Alarcón]: Esto es Radio Ambulante, desde NPR. Soy Daniel Alarcón." res=nlp_es_parse(s)

the first two entries (simplified) of res are:

{
  "id": 1,
  "text": "[Daniel",
},
{
  "id": 2,
  "text": "Alarcón]",
},

Expected behavior

I expect similar behavior as for English:

nlp_en_parse = Pipeline(lang="en", processors="tokenize") res=nlp_en_parse(s)

gives:

{
  "id": 1,
  "text": "[",
},
{
  "id": 2,
  "text": "Daniel",
},
{
  "id": 3,
  "text": "Alarcón",
},
{
  "id": 4,
  "text": "]",
},```

Environment (please complete the following information):

  • MacOS Monterey 12.1 (21C52)
  • Python 3.9.7 from Anaconda3
  • stanza-1.5.0
  • java version "1.8.0_371"

Additional context none

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1257, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMJPKAG53LIEL6DYDTXKEDQ7ANCNFSM6AAAAAAY6QVR6E . You are receiving this because you are subscribed to this thread.Message ID: @.***>

psnider commented 1 year ago

I also found these characters at the beginnings or ends of words: – En Dash U+2013 https://symbl.cc/en/2013/ — End Of Guarded Area U+0097 https://symbl.cc/en/0097/ … Horizontal Ellipsis U+2026 https://symbl.cc/en/2026/

Probably a similar issue. As I understand it, this is primarily determined by the training set used. Is this correct?

Also, I already added a way for me to find these and strip them out. Just as a work around.

psnider commented 1 year ago

I also just found an odd case that seems like it might be related.

From this text: "Ronald ya no pudo seguir ocultándolo… se lo mostró a su mamá." Notice that ocultándolo is followed by the character for Horizontal Ellipsis. The tokens for the segment "ocultándolo… se" follow. Oddly the ellipsis seems to be considered as two tokens: "" and "e"

        {
          "id": [
            21,
            22
          ],
          "text": "ocultándolo",
          "start_char": 211,
          "end_char": 222,
          "ner": "O",
          "multi_ner": [
            "O"
          ]
        },
        {
          "id": 21,
          "text": "ocultando",
          "lemma": "ocultar",
          "upos": "VERB",
          "xpos": "vmg0000",
          "feats": "VerbForm=Ger"
        },
        {
          "id": 22,
          "text": "lo",
          "lemma": "él",
          "upos": "PRON",
          "feats": "Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs"
        },
        {
          "id": [
            23,
            24
          ],
          "text": "…",
          "start_char": 222,
          "end_char": 223,
          "ner": "O",
          "multi_ner": [
            "O"
          ]
        },
        {
          "id": 23,
          "text": "<UNK>",
          "lemma": "<UNK>",
          "upos": "NOUN",
          "xpos": "ncms000",
          "feats": "Gender=Masc|Number=Sing"
        },
        {
          "id": 24,
          "text": "e",
          "lemma": "e",
          "upos": "CCONJ",
          "xpos": "cc"
        },
        {
          "id": 25,
          "text": "se",
          "lemma": "él",
          "upos": "PRON",
          "xpos": "pp3cn000",
          "feats": "Case=Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes",
          "start_char": 224,
          "end_char": 226,
          "ner": "O",
          "multi_ner": [
            "O"
          ]
        },
AngledLuffa commented 1 year ago

I made some changes to the training to hopefully capture [] if the dataset doesn't already have [] in it.

Also, we were in fact attempting to augment the ellipses, but clearly not often enough.

A colleague had suggested making the tokenizer augmentations more dynamic, so they occur on all the sentences some fraction of a time rather than our current method of changing the same sentences each time through the training loop, but that will also have to wait until the end of the month I think.

I'll have the new model available tomorrow morning.

On Thu, Jun 8, 2023 at 12:52 PM psnider @.***> wrote:

I also just found an odd case that seems like it might be related.

From this text: "Ronald ya no pudo seguir ocultándolo… se lo mostró a su mamá." Notice that ocultándolo is followed by the character for Horizontal Ellipsis. The tokens for the segment "ocultándolo… se" follow. Oddly the ellipsis seems to be considered as two tokens: "" and "e"

    {
      "id": [
        21,
        22
      ],
      "text": "ocultándolo",
      "start_char": 211,
      "end_char": 222,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 21,
      "text": "ocultando",
      "lemma": "ocultar",
      "upos": "VERB",
      "xpos": "vmg0000",
      "feats": "VerbForm=Ger"
    },
    {
      "id": 22,
      "text": "lo",
      "lemma": "él",
      "upos": "PRON",
      "feats": "Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs"
    },
    {
      "id": [
        23,
        24
      ],
      "text": "…",
      "start_char": 222,
      "end_char": 223,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 23,
      "text": "<UNK>",
      "lemma": "<UNK>",
      "upos": "NOUN",
      "xpos": "ncms000",
      "feats": "Gender=Masc|Number=Sing"
    },
    {
      "id": 24,
      "text": "e",
      "lemma": "e",
      "upos": "CCONJ",
      "xpos": "cc"
    },
    {
      "id": 25,
      "text": "se",
      "lemma": "él",
      "upos": "PRON",
      "xpos": "pp3cn000",
      "feats": "Case=Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes",
      "start_char": 224,
      "end_char": 226,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1257#issuecomment-1583251001, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLZB6AKFP4753LMPH3XKIUQRANCNFSM6AAAAAAY6QVR6E . You are receiving this because you commented.Message ID: @.***>

AngledLuffa commented 1 year ago

If you are using the dev branch, there is now a Spanish tokenizer which processes [], at least on the example you gave above.

I'm not sure there will be much improvement for ellipses, but I will have more time to finish this up in a couple weeks.

On Fri, Jun 9, 2023 at 12:13 AM John Bauer @.***> wrote:

I made some changes to the training to hopefully capture [] if the dataset doesn't already have [] in it.

Also, we were in fact attempting to augment the ellipses, but clearly not often enough.

A colleague had suggested making the tokenizer augmentations more dynamic, so they occur on all the sentences some fraction of a time rather than our current method of changing the same sentences each time through the training loop, but that will also have to wait until the end of the month I think.

I'll have the new model available tomorrow morning.

On Thu, Jun 8, 2023 at 12:52 PM psnider @.***> wrote:

I also just found an odd case that seems like it might be related.

From this text: "Ronald ya no pudo seguir ocultándolo… se lo mostró a su mamá." Notice that ocultándolo is followed by the character for Horizontal Ellipsis. The tokens for the segment "ocultándolo… se" follow. Oddly the ellipsis seems to be considered as two tokens: "" and "e"

    {
      "id": [
        21,
        22
      ],
      "text": "ocultándolo",
      "start_char": 211,
      "end_char": 222,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 21,
      "text": "ocultando",
      "lemma": "ocultar",
      "upos": "VERB",
      "xpos": "vmg0000",
      "feats": "VerbForm=Ger"
    },
    {
      "id": 22,
      "text": "lo",
      "lemma": "él",
      "upos": "PRON",
      "feats": "Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs"
    },
    {
      "id": [
        23,
        24
      ],
      "text": "…",
      "start_char": 222,
      "end_char": 223,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 23,
      "text": "<UNK>",
      "lemma": "<UNK>",
      "upos": "NOUN",
      "xpos": "ncms000",
      "feats": "Gender=Masc|Number=Sing"
    },
    {
      "id": 24,
      "text": "e",
      "lemma": "e",
      "upos": "CCONJ",
      "xpos": "cc"
    },
    {
      "id": 25,
      "text": "se",
      "lemma": "él",
      "upos": "PRON",
      "xpos": "pp3cn000",
      "feats": "Case=Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes",
      "start_char": 224,
      "end_char": 226,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1257#issuecomment-1583251001, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLZB6AKFP4753LMPH3XKIUQRANCNFSM6AAAAAAY6QVR6E . You are receiving this because you commented.Message ID: @.***>

psnider commented 1 year ago

I found an error that seems to be related to the use of elipses. Adding elipses to the end of a phrase changes the VERB features for at least this one word. The features appear to be correct without the elipses, but changes with the elipses.

from stanza.pipeline.core import Pipeline
processors = "tokenize,mwt,ner,sentiment,pos,lemma"
parser = Pipeline(lang="es", processors=processors)
wo_elipses="Y felices no estaban"
nlp_results = parser(wo_elipses)
w_elipses="Y felices no estaban…"
nlp_results = parser(w_elipses)

The last word in each is shown (simplified):