stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.16k stars 882 forks source link

Language Tamil - Wrong POS tag for "ஊறு" (VERB instead of ADJ) #1319

Open SmartManoj opened 7 months ago

SmartManoj commented 7 months ago

Describe the bug ஊறு

To Reproduce Steps to reproduce the behavior:

import logging
import stanza

logging.getLogger('stanza').setLevel(logging.ERROR)

# Download and initialize the Tamil model
# stanza.download('ta')

nlp = stanza.Pipeline(lang='ta')

# Sample text in Tamil
text = "ஊறு + காய் "
# Process the text
doc = nlp(text)

# Iterate over the sentences and tokens to print POS tags
print(f'{"POS":<7} | {"WORD":<10}')
for sentence in doc.sentences:
    for word in sentence.words:
        print(f"{word.pos:7} | {word.text}")

Output:

POS     | WORD      
ADV     | ஊறு
NOUN    | +
PROPN   | காய்

Environment (please complete the following information):

AngledLuffa commented 7 months ago

There isn't a lot of labeled data for Tamil, but we can possibly improve the results for Tamil by including a transformer or at least a charlm. Let me investigate that.

AngledLuffa commented 7 months ago

The simplest improvement to make was to add a transformer. I chose Google's Muril Large, as it scored the highest on the dev sets of the UD POS and depparse tasks.

(Edit: you can use it now, with the existing 1.7.0 release, with package="default_accurate" when building a pipeline)

If that's not sufficient improvement, we could also look into getting more data and including it in the model's training data.

SmartManoj commented 7 months ago

Right

செய் VERB தவம் PROPN

-- Wrong With default_accurate

செய் PUNCT தவம் PUNCT


Code:

import logging
import stanza
from transformers import logging
logging.set_verbosity_error()
stanza.logging.getLogger('stanza').setLevel(logging.ERROR)

# Download and initialize the Tamil model
# stanza.download('ta')
print('Stanza model loading...')
if 1:
    nlp = stanza.Pipeline(lang='ta',package="default_accurate")
else:
    nlp = stanza.Pipeline(lang='ta')
print('Stanza model loaded.')
def do_nlp(text,verbose=False):
    doc = nlp(text)
    # Iterate over the sentences and tokens to print POS tags
    if verbose:
        print(f'{"POS":<7} | {"WORD":<10}')
    res = []
    for sentence in doc.sentences:
        for word in sentence.words:
            if verbose:
                print(f"{word.pos:7} | {word.text}")
            else:
                res.append(word.pos)
    return ' '.join(res)
    print('----------------------')
# Sample text in Tamil
if __name__ == '__main__':
    for i in ('செய்','தவம்',):
        print(i,do_nlp(i))
SmartManoj commented 7 months ago

கற்ற is ADJ not ADV

AngledLuffa commented 7 months ago

Ultimately we would need more data to fix this. Maybe one of the other Tamil POS datasets I mentioned will be suitable

On Sun, Dec 10, 2023, 12:29 AM மனோஜ்குமார் பழனிச்சாமி < @.***> wrote:

கற்ற is ADJ not ADV

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1319#issuecomment-1848470609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOTSP3RD2BGW3NTZKDYISGWBAVCNFSM6AAAAABAKLF2PSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYGQ3TANRQHE . You are receiving this because you commented.Message ID: @.***>

AngledLuffa commented 7 months ago

Wait... it shouldn't be tagging things PUNCT. Weirdly I thought that would be improved with some recent changes we made. I can investigate later on

On Sun, Dec 10, 2023, 10:16 AM John Bauer @.***> wrote:

Ultimately we would need more data to fix this. Maybe one of the other Tamil POS datasets I mentioned will be suitable

On Sun, Dec 10, 2023, 12:29 AM மனோஜ்குமார் பழனிச்சாமி < @.***> wrote:

கற்ற is ADJ not ADV

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1319#issuecomment-1848470609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOTSP3RD2BGW3NTZKDYISGWBAVCNFSM6AAAAABAKLF2PSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYGQ3TANRQHE . You are receiving this because you commented.Message ID: @.***>

AngledLuffa commented 7 months ago

Alright, if you try it again, I set the punct "dropout" for the end of sentences to be significantly higher. I should probably experiment to see if that can just be the default setting for all languages

SmartManoj commented 7 months ago

Got error

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 369kB [00:00, 24.6MB/s]
2023-12-10 19:08:17 INFO: Downloading default packages for language: ta (Tamil) ...
2023-12-10 19:08:18 INFO: File exists: C:\Users\smart\stanza_resources\ta\default.zip
2023-12-10 19:08:20 INFO: Finished downloading models and saved to C:\Users\smart\stanza_resources.
Stanza model loading...
2023-12-10 19:08:20 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 369kB [00:00, 22.4MB/s]
2023-12-10 19:08:21 INFO: Loading these models for language: ta (Tamil):
=====================================
| Processor | Package               |
-------------------------------------
| tokenize  | ttb                   |
| mwt       | ttb                   |
| pos       | ttb_muril-large-cased |
| lemma     | ttb_nocharlm          |
| depparse  | ttb_muril-large-cased |
=====================================

2023-12-10 19:08:21 INFO: Using device: cpu
2023-12-10 19:08:21 INFO: Loading: tokenize
2023-12-10 19:08:22 INFO: Loading: mwt
2023-12-10 19:08:22 INFO: Loading: pos
2023-12-10 19:08:32 INFO: Loading: lemma
2023-12-10 19:08:32 INFO: Loading: depparse
Traceback (most recent call last):
  File "c:\Users\smart\Desktop\p2p\c5.py", line 11, in <module>
    nlp = stanza.Pipeline(lang='ta',package="default_accurate")
  File "C:\Python310\lib\site-packages\stanza\pipeline\core.py", line 304, in __init__
    self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config,
  File "C:\Python310\lib\site-packages\stanza\pipeline\depparse_processor.py", line 30, in __init__
    super().__init__(config, pipeline, device)
  File "C:\Python310\lib\site-packages\stanza\pipeline\processor.py", line 193, in __init__
    self._set_up_model(config, pipeline, device)
  File "C:\Python310\lib\site-packages\stanza\pipeline\depparse_processor.py", line 43, in _set_up_model
    self._trainer = Trainer(args=args, pretrain=self.pretrain, model_file=config['model_path'], device=device, foundation_cache=pipeline.foundation_cache)
  File "C:\Python310\lib\site-packages\stanza\models\depparse\trainer.py", line 34, in __init__
    self.load(model_file, pretrain, args, foundation_cache)
  File "C:\Python310\lib\site-packages\stanza\models\depparse\trainer.py", line 120, in load
    self.model = Parser(self.args, self.vocab, emb_matrix=emb_matrix, foundation_cache=foundation_cache)
  File "C:\Python310\lib\site-packages\stanza\models\depparse\model.py", line 38, in __init__
    self.lemma_emb = nn.Embedding(len(vocab['lemma']), self.args['word_emb_dim'], padding_idx=0)
  File "C:\Python310\lib\site-packages\stanza\models\common\vocab.py", line 228, in __getitem__
    return self._vocabs[key]
KeyError: 'lemma'
SmartManoj commented 7 months ago

I set

Where did you set?

AngledLuffa commented 7 months ago

That was a training parameter.

Sorry for the inconvenience with the models. That should be fixed now.

SmartManoj commented 7 months ago

Where

Got it https://huggingface.co/stanfordnlp/stanza-ta/commit/1a6352282b2e28a8aa9a9da7f33f215e71405745

SmartManoj commented 7 months ago

கற்ற is ADJ not ADV

Now it is showing as VERB. Is there any visualization tool for how it detects?

AngledLuffa commented 7 months ago

No visualization tool. However, I will point out that the models all expect context. A single word isn't a great query to give it if there's no surrounding text. I don't know about Tamil, but in English it wouldn't even be possible to correctly tag single words: "tag", "tool", "point", "query" being examples from this comment which would be ambiguous.

SmartManoj commented 7 months ago

Eg: A learned boy. DET VERB NOUN PUNCT

Here, shouldn't "learned" be ADJ?

import logging
import stanza
from transformers import logging
logging.set_verbosity_error()
stanza.logging.getLogger('stanza').setLevel(logging.ERROR)

# Download and initialize the Tamil model
# stanza.download('ta')
print('Stanza model loading...')
lang='ta'
lang='en'
if 1:
    nlp = stanza.Pipeline(lang=lang,package="default_accurate")
else:
    nlp = stanza.Pipeline(lang=lang)
print('Stanza model loaded.')
def do_nlp(text,verbose=False):
    doc = nlp(text)
    # Iterate over the sentences and tokens to print POS tags
    if verbose:
        print(f'{"POS":<7} | {"WORD":<10}')
    res = []
    for sentence in doc.sentences:
        for word in sentence.words:
            if verbose:
                print(f"{word.pos:7} | {word.text}")
            else:
                res.append(word.pos)
    return ' '.join(res)
    print('----------------------')
# Sample text in Tamil
if __name__ == '__main__':
    words = ('கற்ற சிறுவன்',)
    words = ('A learned boy.',)
    for i in words:
        print(i,do_nlp(i))
AngledLuffa commented 7 months ago

This one is kinda borderline, and I'll point to some examples of trained used as a verb in the EWT and GUM datasets:

# sent_id = newsgroup-groups.google.com_misc.consumers_a534e32067078b08_ENG_20060116_030800-0026
# text = They include 120,000 Iranian Revolutionary Guards trained for land and naval asymmetrical warfare.
4       Iranian Iranian ADJ     NNP     Degree=Pos      6       amod    6:amod  _
5       Revolutionary   Revolutionary   ADJ     NNP     Degree=Pos      6       amod    6:amod  _
6       Guards  Guard   PROPN   NNPS    Number=Plur     2       obj     2:obj   _
7       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     6       acl     6:acl   _
8       for     for     ADP     IN      _       13      case    13:case _
9       land    land    NOUN    NN      Number=Sing     13      compound        13:compound     _
10      and     and     CCONJ   CC      _       11      cc      11:cc   _
11      naval   naval   ADJ     JJ      Degree=Pos      9       conj    9:conj:and|13:compound  _
12      asymmetrical    asymmetrical    ADJ     JJ      Degree=Pos      13      amod    13:amod _
13      warfare warfare NOUN    NN      Number=Sing     7       obl     7:obl:for       SpaceAfter=No

# sent_id = answers-20111108105225AAAJ9ek_ans-0014
# text = If your cat is not trained to use the litter pan, you may have a problem taking her.
1       If      if      SCONJ   IN      _       6       mark    6:mark  _
2       your    your    PRON    PRP$    Case=Gen|Person=2|Poss=Yes|PronType=Prs 3       nmod:poss       3:nmod:poss     _
3       cat     cat     NOUN    NN      Number=Sing     6       nsubj:pass      6:nsubj:pass|8:nsubj:xsubj      _
4       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   6       aux:pass        6:aux:pass      _
5       not     not     PART    RB      _       6       advmod  6:advmod        _
6       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     15      advcl   15:advcl:if     _

# sent_id = answers-20111108111031AARG57j_ans-0015
# text = She is crate trained, potty trained, ...
1       She     she     PRON    PRP     Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs   4       nsubj:pass      4:nsubj:pass|7:nsubj:pass|11:nsubj      _
2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       aux:pass        4:aux:pass      _
3       crate   crate   NOUN    NN      Number=Sing     4       obl:npmod       4:obl:npmod     _
4       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     0       root    0:root  SpaceAfter=No
5       ,       ,       PUNCT   ,       _       7       punct   7:punct _
6       potty   potty   NOUN    NN      Number=Sing     7       obl:npmod       7:obl:npmod     _
7       trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     4       conj    4:conj:and      SpaceAfter=No

# sent_id = GUM_voyage_sydfynske-27
# text = Several rental places also gives you the option of trained guide, which can both provide information about the sights you visit, and make sure you are safe.
7       the     the     DET     DT      Definite=Def|PronType=Art       8       det     8:det   Entity=(109-abstract-new-cf3-2-sgl
8       option  option  NOUN    NN      Number=Sing     5       obj     5:obj   MSeg=opt-ion
9       of      of      ADP     IN      _       11      case    11:case _
10      trained train   VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     11      amod    11:amod Entity=(110-person-new-cf6-2-sgl|MSeg=train-ed
11      guide   guide   NOUN    NN      Number=Sing     8       nmod    8:nmod:of|16:nsubj      SpaceAfter=No

so maybe learned as VERB is correct here. However, regardless, there isn't a single instance of learned as an ADJ in the datasets we use to train, so I would never expect the model to get it right.

SmartManoj commented 7 months ago

Did you think to point examples of learned instead of trained?

https://www.dictionary.com/browse/learned

AngledLuffa commented 7 months ago

Yes, as I earlier stated, I looked for those examples, and there was not a single example of learned in the training data. I mean, I do understand the meaning of learned that you're going for, but 1) as shown with the trained examples, it's not clear the annotation scheme we used would have tagged it as ADJ or as related to the use of the past participle of "he learned something" and 2) it's immaterial because there are 33 instances of learned as a VERB and 0 as an ADJ, so the statistical models we use will tag it as a VERB no matter sentence you write.

You seem to care deeply about this particular possible mistagging, so I created an issue where I asked people who know more linguistics than I do what their opinion is:

https://github.com/UniversalDependencies/docs/issues/1004

If they like ADJ we can possibly add a few sentences to the English training data with the appropriate context, but that's not a "today" project, at any rate.

SmartManoj commented 7 months ago

that you're going for

https://www.dictionary.com/browse/learned#:~:text=on%20Thesaurus.com-,adjective,-having%20much%20knowledge

AngledLuffa commented 7 months ago

As I said, I am familiar with that meaning, and it is not in use anywhere in the training data, which makes it a serious problem for a statistical model to be able to predict that meaning. Is there some clarification needed on how that works?

SmartManoj commented 7 months ago

that you're going for

dictionary.com/browse/learned#:~:text=on%20Thesaurus.com-,adjective,-having%20much%20knowledge

I was saying that they mentioned it as an Adjective instead of a verb here.

--

https://www.dictionary.com/browse/trained

For trained, they redirected to train itself.

--

What do you think about this?