nytud / hunlp-GATE

Lang_Hungarian - a GATE plugin containing Hungarian NLP tools as GATE processing resources
GNU General Public License v3.0
8 stars 6 forks source link

Incorrect POS tagging #7

Open DavidNemeskey opened 7 years ago

DavidNemeskey commented 7 years ago

In the run below, the pipeline achieves 33% accuracy on verb-noun homonym pairs:

Józsi nagyot szív.                                                               
A szív egy szerv.                                                                
Megjött a fagy, mi fagy ma meg?                                                  
Jancsi vár a vár előtt.

(I couldn't upload the resulting XML file, because GitHub.)

This configuration was used:

# HU 1. "emToken" Sentence Splitter and Tokenizer (QunToken, native) [Linux]     
hu.nytud.gate.tokenizers.QunTokenCommandLine                                     

# HU 2. "emMorph+emLem" Morphological Analyzer and Lemmatizer (HFST, hfst, native+java)
hu.nytud.gate.morph.HFSTMorphAndLemma                                            

# HU 3. "emTag" POS Tagger and Lemmatizer (PurePOS in magyarlanc3.0, hfst)       
hu.nytud.gate.postaggers.Magyarlanc3POSTaggerLemmatizer

Is it a bug in the disambiguator (which is which component again?), or did I just not invoke the right PR?

sassbalint commented 7 years ago

The disambiguator is "emTag", the third one, you invoked the right PR. I guess, it is not a bug, but a matter of performance of the tool. @dlazesz could you please look at this issue?

dlazesz commented 7 years ago

The aforementioned examples on the raw tagger: Józsi nagyot szív . Józsi#Józsi#[/N][Nom] nagyot#nagy#[/Adj][Acc] szív#szív#[/N][Nom] .#.#OTHER A szív egy szerv . A#a#[/Det|art.Def] szív#szív#[/N][Nom] egy#egy#[/Det|art.NDef] szerv#szerv#[/N][Nom] .#.#OTHER Megjött a fagy , mi fagy ma meg ? Megjött#megjön#[/V][Pst.NDef.3Sg] a#a#[/Det|art.Def] fagy#fagy#[/V][Prs.NDef.3Sg] ,#,#OTHER mi#mi#[/N|Pro|Int][Nom] fagy#fagy#[/V][Prs.NDef.3Sg] ma#ma#[/Adv] meg#meg#[/Prev] ?#?#OTHER Jancsi vár a vár előtt . Jancsi#Jancsi#[/N][Nom] vár#vár#[/V][Prs.NDef.3Sg] a#a#[/Det|art.Def] vár#vár#[/N][Nom] előtt#előtt#[/Post] .#.#OTHER

The mistagged ones:

  1. Józsi#Józsi#[/N][Nom] nagyot#nagy#[/Adj][Acc] szív#szív#[/N][Nom] .#.#OTHER:

    There is no other analysis of "szív" in the training than [/N][Nom].

  2. Megjött#megjön#[/V][Pst.NDef.3Sg] a#a#[/Det|art.Def] fagy#fagy#[/V][Prs.NDef.3Sg] ,#,#OTHER

    Same goes for "fagy". Only [/V][Prs.NDef.3Sg] is present in the corpus.

The others are good.

This is a limitation of the model: If the word has been seen previously in the training set, then the provided morphological analyses will be intersected with the previously seen analyses. In our case, the only common element is chosen afterwards.

DavidNemeskey commented 7 years ago

Thanks for the analysis, but ... then what? I don't really see how the presence of these precise words in the training corpus was relevant. As I see it, there are two tag sequences in the output, which never occur in Hungarian:

No sane disambiguator would choose these sequences in favour of the (correct) alternatives, irrespective of whether it has seen the tagged words at training time or not.

dlazesz commented 7 years ago

The original model used only the training corpus for learning. The morphology, then used to filter bad analyses, which is also holds for unknown words, that is known for the morphology, but not present in the training corpus. There was no direct assumption of homonyms, half known and half unknown, because in this case the easiest solution is to add specific examples to the training corpus, rather than just add new analsyses to the morphology and wait for a wonder.

So in this sence the model prefers the lexical disambiguations to the n-gram disambiguation. I think this was a design decision earlier. (Maybe even in HunPOS.)

The fix you wish, would involved in a complete redesign of the model. If you wish to add some PRs please use the appropriate bugtracker. But keep in mind, that other, solved problems should not be introduced following the patches.

Or if you have other taggers to measure overall performance, please inform us, if it beats emTag.