Open SmartManoj opened 7 months ago
There isn't a lot of labeled data for Tamil, but we can possibly improve the results for Tamil by including a transformer or at least a charlm. Let me investigate that.
The simplest improvement to make was to add a transformer. I chose Google's Muril Large, as it scored the highest on the dev sets of the UD POS and depparse tasks.
(Edit: you can use it now, with the existing 1.7.0 release, with package="default_accurate"
when building a pipeline)
If that's not sufficient improvement, we could also look into getting more data and including it in the model's training data.
Right
செய் VERB தவம் PROPN
-- Wrong With default_accurate
செய் PUNCT தவம் PUNCT
Code:
import logging
import stanza
from transformers import logging
logging.set_verbosity_error()
stanza.logging.getLogger('stanza').setLevel(logging.ERROR)
# Download and initialize the Tamil model
# stanza.download('ta')
print('Stanza model loading...')
if 1:
nlp = stanza.Pipeline(lang='ta',package="default_accurate")
else:
nlp = stanza.Pipeline(lang='ta')
print('Stanza model loaded.')
def do_nlp(text,verbose=False):
doc = nlp(text)
# Iterate over the sentences and tokens to print POS tags
if verbose:
print(f'{"POS":<7} | {"WORD":<10}')
res = []
for sentence in doc.sentences:
for word in sentence.words:
if verbose:
print(f"{word.pos:7} | {word.text}")
else:
res.append(word.pos)
return ' '.join(res)
print('----------------------')
# Sample text in Tamil
if __name__ == '__main__':
for i in ('செய்','தவம்',):
print(i,do_nlp(i))
கற்ற is ADJ not ADV
Ultimately we would need more data to fix this. Maybe one of the other Tamil POS datasets I mentioned will be suitable
On Sun, Dec 10, 2023, 12:29 AM மனோஜ்குமார் பழனிச்சாமி < @.***> wrote:
கற்ற is ADJ not ADV
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1319#issuecomment-1848470609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOTSP3RD2BGW3NTZKDYISGWBAVCNFSM6AAAAABAKLF2PSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYGQ3TANRQHE . You are receiving this because you commented.Message ID: @.***>
Wait... it shouldn't be tagging things PUNCT. Weirdly I thought that would be improved with some recent changes we made. I can investigate later on
On Sun, Dec 10, 2023, 10:16 AM John Bauer @.***> wrote:
Ultimately we would need more data to fix this. Maybe one of the other Tamil POS datasets I mentioned will be suitable
On Sun, Dec 10, 2023, 12:29 AM மனோஜ்குமார் பழனிச்சாமி < @.***> wrote:
கற்ற is ADJ not ADV
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1319#issuecomment-1848470609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOTSP3RD2BGW3NTZKDYISGWBAVCNFSM6AAAAABAKLF2PSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYGQ3TANRQHE . You are receiving this because you commented.Message ID: @.***>
Alright, if you try it again, I set the punct "dropout" for the end of sentences to be significantly higher. I should probably experiment to see if that can just be the default setting for all languages
Got error
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 369kB [00:00, 24.6MB/s]
2023-12-10 19:08:17 INFO: Downloading default packages for language: ta (Tamil) ...
2023-12-10 19:08:18 INFO: File exists: C:\Users\smart\stanza_resources\ta\default.zip
2023-12-10 19:08:20 INFO: Finished downloading models and saved to C:\Users\smart\stanza_resources.
Stanza model loading...
2023-12-10 19:08:20 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 369kB [00:00, 22.4MB/s]
2023-12-10 19:08:21 INFO: Loading these models for language: ta (Tamil):
=====================================
| Processor | Package |
-------------------------------------
| tokenize | ttb |
| mwt | ttb |
| pos | ttb_muril-large-cased |
| lemma | ttb_nocharlm |
| depparse | ttb_muril-large-cased |
=====================================
2023-12-10 19:08:21 INFO: Using device: cpu
2023-12-10 19:08:21 INFO: Loading: tokenize
2023-12-10 19:08:22 INFO: Loading: mwt
2023-12-10 19:08:22 INFO: Loading: pos
2023-12-10 19:08:32 INFO: Loading: lemma
2023-12-10 19:08:32 INFO: Loading: depparse
Traceback (most recent call last):
File "c:\Users\smart\Desktop\p2p\c5.py", line 11, in <module>
nlp = stanza.Pipeline(lang='ta',package="default_accurate")
File "C:\Python310\lib\site-packages\stanza\pipeline\core.py", line 304, in __init__
self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config,
File "C:\Python310\lib\site-packages\stanza\pipeline\depparse_processor.py", line 30, in __init__
super().__init__(config, pipeline, device)
File "C:\Python310\lib\site-packages\stanza\pipeline\processor.py", line 193, in __init__
self._set_up_model(config, pipeline, device)
File "C:\Python310\lib\site-packages\stanza\pipeline\depparse_processor.py", line 43, in _set_up_model
self._trainer = Trainer(args=args, pretrain=self.pretrain, model_file=config['model_path'], device=device, foundation_cache=pipeline.foundation_cache)
File "C:\Python310\lib\site-packages\stanza\models\depparse\trainer.py", line 34, in __init__
self.load(model_file, pretrain, args, foundation_cache)
File "C:\Python310\lib\site-packages\stanza\models\depparse\trainer.py", line 120, in load
self.model = Parser(self.args, self.vocab, emb_matrix=emb_matrix, foundation_cache=foundation_cache)
File "C:\Python310\lib\site-packages\stanza\models\depparse\model.py", line 38, in __init__
self.lemma_emb = nn.Embedding(len(vocab['lemma']), self.args['word_emb_dim'], padding_idx=0)
File "C:\Python310\lib\site-packages\stanza\models\common\vocab.py", line 228, in __getitem__
return self._vocabs[key]
KeyError: 'lemma'
I set
Where did you set?
That was a training parameter.
Sorry for the inconvenience with the models. That should be fixed now.
கற்ற is ADJ not ADV
Now it is showing as VERB. Is there any visualization tool for how it detects?
No visualization tool. However, I will point out that the models all expect context. A single word isn't a great query to give it if there's no surrounding text. I don't know about Tamil, but in English it wouldn't even be possible to correctly tag single words: "tag", "tool", "point", "query" being examples from this comment which would be ambiguous.
Eg: A learned boy. DET VERB NOUN PUNCT
Here, shouldn't "learned" be ADJ?
import logging
import stanza
from transformers import logging
logging.set_verbosity_error()
stanza.logging.getLogger('stanza').setLevel(logging.ERROR)
# Download and initialize the Tamil model
# stanza.download('ta')
print('Stanza model loading...')
lang='ta'
lang='en'
if 1:
nlp = stanza.Pipeline(lang=lang,package="default_accurate")
else:
nlp = stanza.Pipeline(lang=lang)
print('Stanza model loaded.')
def do_nlp(text,verbose=False):
doc = nlp(text)
# Iterate over the sentences and tokens to print POS tags
if verbose:
print(f'{"POS":<7} | {"WORD":<10}')
res = []
for sentence in doc.sentences:
for word in sentence.words:
if verbose:
print(f"{word.pos:7} | {word.text}")
else:
res.append(word.pos)
return ' '.join(res)
print('----------------------')
# Sample text in Tamil
if __name__ == '__main__':
words = ('கற்ற சிறுவன்',)
words = ('A learned boy.',)
for i in words:
print(i,do_nlp(i))
This one is kinda borderline, and I'll point to some examples of trained
used as a verb in the EWT and GUM datasets:
# sent_id = newsgroup-groups.google.com_misc.consumers_a534e32067078b08_ENG_20060116_030800-0026
# text = They include 120,000 Iranian Revolutionary Guards trained for land and naval asymmetrical warfare.
4 Iranian Iranian ADJ NNP Degree=Pos 6 amod 6:amod _
5 Revolutionary Revolutionary ADJ NNP Degree=Pos 6 amod 6:amod _
6 Guards Guard PROPN NNPS Number=Plur 2 obj 2:obj _
7 trained train VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 6 acl 6:acl _
8 for for ADP IN _ 13 case 13:case _
9 land land NOUN NN Number=Sing 13 compound 13:compound _
10 and and CCONJ CC _ 11 cc 11:cc _
11 naval naval ADJ JJ Degree=Pos 9 conj 9:conj:and|13:compound _
12 asymmetrical asymmetrical ADJ JJ Degree=Pos 13 amod 13:amod _
13 warfare warfare NOUN NN Number=Sing 7 obl 7:obl:for SpaceAfter=No
# sent_id = answers-20111108105225AAAJ9ek_ans-0014
# text = If your cat is not trained to use the litter pan, you may have a problem taking her.
1 If if SCONJ IN _ 6 mark 6:mark _
2 your your PRON PRP$ Case=Gen|Person=2|Poss=Yes|PronType=Prs 3 nmod:poss 3:nmod:poss _
3 cat cat NOUN NN Number=Sing 6 nsubj:pass 6:nsubj:pass|8:nsubj:xsubj _
4 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 6 aux:pass 6:aux:pass _
5 not not PART RB _ 6 advmod 6:advmod _
6 trained train VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 15 advcl 15:advcl:if _
# sent_id = answers-20111108111031AARG57j_ans-0015
# text = She is crate trained, potty trained, ...
1 She she PRON PRP Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs 4 nsubj:pass 4:nsubj:pass|7:nsubj:pass|11:nsubj _
2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 aux:pass 4:aux:pass _
3 crate crate NOUN NN Number=Sing 4 obl:npmod 4:obl:npmod _
4 trained train VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 0 root 0:root SpaceAfter=No
5 , , PUNCT , _ 7 punct 7:punct _
6 potty potty NOUN NN Number=Sing 7 obl:npmod 7:obl:npmod _
7 trained train VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 4 conj 4:conj:and SpaceAfter=No
# sent_id = GUM_voyage_sydfynske-27
# text = Several rental places also gives you the option of trained guide, which can both provide information about the sights you visit, and make sure you are safe.
7 the the DET DT Definite=Def|PronType=Art 8 det 8:det Entity=(109-abstract-new-cf3-2-sgl
8 option option NOUN NN Number=Sing 5 obj 5:obj MSeg=opt-ion
9 of of ADP IN _ 11 case 11:case _
10 trained train VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 11 amod 11:amod Entity=(110-person-new-cf6-2-sgl|MSeg=train-ed
11 guide guide NOUN NN Number=Sing 8 nmod 8:nmod:of|16:nsubj SpaceAfter=No
so maybe learned
as VERB
is correct here. However, regardless, there isn't a single instance of learned
as an ADJ
in the datasets we use to train, so I would never expect the model to get it right.
Did you think to point examples of learned
instead of trained
?
Yes, as I earlier stated, I looked for those examples, and there was not a single example of learned
in the training data. I mean, I do understand the meaning of learned that you're going for, but 1) as shown with the trained
examples, it's not clear the annotation scheme we used would have tagged it as ADJ
or as related to the use of the past participle of "he learned something" and 2) it's immaterial because there are 33 instances of learned
as a VERB
and 0 as an ADJ
, so the statistical models we use will tag it as a VERB
no matter sentence you write.
You seem to care deeply about this particular possible mistagging, so I created an issue where I asked people who know more linguistics than I do what their opinion is:
https://github.com/UniversalDependencies/docs/issues/1004
If they like ADJ
we can possibly add a few sentences to the English training data with the appropriate context, but that's not a "today" project, at any rate.
As I said, I am familiar with that meaning, and it is not in use anywhere in the training data, which makes it a serious problem for a statistical model to be able to predict that meaning. Is there some clarification needed on how that works?
that you're going for
dictionary.com/browse/learned#:~:text=on%20Thesaurus.com-,adjective,-having%20much%20knowledge
I was saying that they mentioned it as an Adjective instead of a verb here.
--
https://www.dictionary.com/browse/trained
For trained
, they redirected to train
itself.
--
What do you think about this?
Describe the bug ஊறு
To Reproduce Steps to reproduce the behavior:
Output:
Environment (please complete the following information):