Open DavidNemeskey opened 8 years ago
The disambiguator is "emTag", the third one, you invoked the right PR. I guess, it is not a bug, but a matter of performance of the tool. @dlazesz could you please look at this issue?
The aforementioned examples on the raw tagger: Józsi nagyot szív . Józsi#Józsi#[/N][Nom] nagyot#nagy#[/Adj][Acc] szív#szív#[/N][Nom] .#.#OTHER A szív egy szerv . A#a#[/Det|art.Def] szív#szív#[/N][Nom] egy#egy#[/Det|art.NDef] szerv#szerv#[/N][Nom] .#.#OTHER Megjött a fagy , mi fagy ma meg ? Megjött#megjön#[/V][Pst.NDef.3Sg] a#a#[/Det|art.Def] fagy#fagy#[/V][Prs.NDef.3Sg] ,#,#OTHER mi#mi#[/N|Pro|Int][Nom] fagy#fagy#[/V][Prs.NDef.3Sg] ma#ma#[/Adv] meg#meg#[/Prev] ?#?#OTHER Jancsi vár a vár előtt . Jancsi#Jancsi#[/N][Nom] vár#vár#[/V][Prs.NDef.3Sg] a#a#[/Det|art.Def] vár#vár#[/N][Nom] előtt#előtt#[/Post] .#.#OTHER
The mistagged ones:
Józsi#Józsi#[/N][Nom] nagyot#nagy#[/Adj][Acc] szív#szív#[/N][Nom] .#.#OTHER:
There is no other analysis of "szív" in the training than [/N][Nom].
Megjött#megjön#[/V][Pst.NDef.3Sg] a#a#[/Det|art.Def] fagy#fagy#[/V][Prs.NDef.3Sg] ,#,#OTHER
Same goes for "fagy". Only [/V][Prs.NDef.3Sg] is present in the corpus.
The others are good.
This is a limitation of the model: If the word has been seen previously in the training set, then the provided morphological analyses will be intersected with the previously seen analyses. In our case, the only common element is chosen afterwards.
Thanks for the analysis, but ... then what? I don't really see how the presence of these precise words in the training corpus was relevant. As I see it, there are two tag sequences in the output, which never occur in Hungarian:
ART VERB
ADJ+ACC NOUN+NOM
No sane disambiguator would choose these sequences in favour of the (correct) alternatives, irrespective of whether it has seen the tagged words at training time or not.
The original model used only the training corpus for learning. The morphology, then used to filter bad analyses, which is also holds for unknown words, that is known for the morphology, but not present in the training corpus. There was no direct assumption of homonyms, half known and half unknown, because in this case the easiest solution is to add specific examples to the training corpus, rather than just add new analsyses to the morphology and wait for a wonder.
So in this sence the model prefers the lexical disambiguations to the n-gram disambiguation. I think this was a design decision earlier. (Maybe even in HunPOS.)
The fix you wish, would involved in a complete redesign of the model. If you wish to add some PRs please use the appropriate bugtracker. But keep in mind, that other, solved problems should not be introduced following the patches.
Or if you have other taggers to measure overall performance, please inform us, if it beats emTag.
In the run below, the pipeline achieves 33% accuracy on verb-noun homonym pairs:
(I couldn't upload the resulting XML file, because GitHub.)
This configuration was used:
Is it a bug in the disambiguator (which is which component again?), or did I just not invoke the right PR?