Closed pySilver closed 10 years ago
I'll look at it.
This is not a bug in Morfologik. Here is the output from parsing your input:
> 'nie'
- nie, conj+qub - on, ppron3:pl:acc:m2.m3.f.n1.n2.p2.p3:ter:akc.nakc:praep+ppron3:sg:acc:n1.n2:ter:akc.nakc:praep
> 'zabrakło'
- zabraknąć, verb:praet:sg:n1.n2:ter:perf:nonrefl
> 'oczywiście'
- oczywiście, adv:pos+qub
> 'wpadek'
- wpadka, subst:pl:gen:f
> 'największym'
- duży, adj:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:sup+adj:sg:inst:m1.m2.m3.n1.n2:sup+adj:sg:loc:m1.m2.m3.n1.n2:sup - wielki, adj:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:sup+adj:sg:inst:m1.m2.m3.n1.n2:sup+adj:sg:loc:m1.m2.m3.n1.n2:sup
> 'zaskoczeniem'
- zaskoczenie, subst:sg:inst:n2 - zaskoczyć, ger:sg:inst:n2:perf:aff:refl.nonrefl
> 'okazał'
- okazać, verb:praet:sg:m1.m2.m3:ter:perf:refl.nonrefl
> 'się'
- się, siebie:acc:nakc+siebie:gen:nakc+subst:sg:nom:n2
> 'dla'
- dla, prep:gen
> 'nas'
- my, ppron12:pl:acc:m1.m2.m3.f.n1.n2.p1.p2.p3:pri+ppron12:pl:gen:m1.m2.m3.f.n1.n2.p1.p2.p3:pri+ppron12:pl:loc:m1.m2.m3.f.n1.n2.p1.p2.p3:pri
> 'strój'
- strój, subst:sg:acc:m3+subst:sg:nom:m3 - stroić, verb:impt:sg:sec:imperf:refl.nonrefl
> 'katarzyny'
> 'zielińskiej'
> 'której'
- który, adj:sg:dat:f:pos+adj:sg:gen:f:pos+adj:sg:loc:f:pos
> 'ewidentnie'
- ewidentnie, adv:pos
> 'o'
- o, interj+prep:acc+prep:loc - ocean, brev:pun - ojciec, brev:pun
> 'coś'
- coś, qub+subst:sg:acc:n2+subst:sg:gen:n2+subst:sg:nom:n2
> 'chodziło'
- chodzić, verb:praet:sg:n1.n2:ter:imperf:nonrefl
> 'ale'
- ale, conj+qub
> 'wciąż'
- wciąż, adv
> 'nie'
- nie, conj+qub - on, ppron3:pl:acc:m2.m3.f.n1.n2.p2.p3:ter:akc.nakc:praep+ppron3:sg:acc:n1.n2:ter:akc.nakc:praep
> 'wiemy'
- wiedzieć, verb:fin:pl:pri:imperf:nonrefl+verb:fin:pl:pri:imperf:refl.nonrefl
> 'o'
- o, interj+prep:acc+prep:loc - ocean, brev:pun - ojciec, brev:pun
> 'co'
- co, comp+prep:acc+prep:nom+qub - co, subst:sg:acc:n2 - co, subst:sg:gen:n2 - co, subst:sg:nom:n2
You have to look upstream at where this gets messed up. I've committed the sample code to produce the above to the repository.
Damn... wait... the ocean/ ojciec is there!
o o interj+prep:acc+prep:loc o ocean brev:pun o ojciec brev:pun
This seems to be a problem in the latest source forms dictionary. @milekpl can you confirm?
This is a feature of expanding abbreviations. Disambiguation is up to the user. Marcin 11 sie 2014 15:29 "Dawid Weiss" notifications@github.com napisał(a):
o o interj+prep:acc+prep:loc o ocean brev:pun o ojciec brev:pun
This seems to be a problem in the latest source forms dictionary. @milekpl https://github.com/milekpl can you confirm?
— Reply to this email directly or view it on GitHub https://github.com/morfologik/morfologik-stemming/issues/27#issuecomment-51773420 .
This will look like a bug to most people. I honestly thought they would not be abbreviations... to be honest, I don't think such entries make a lot of sense. Not without punctuation... Or a disambiguation engine.
In any case, stripping brev:pun (by tag) would be probably a sensible thing to do in any downstream projects.
This is also how Morfeusz works (or will work). Simply, morphological analysis requires disambiguation.
Many people use morfologik-stemming for simple stemming, without full morphological analysis. We don't have a publicly available disambiguation engine too, so this isn't a (realistic) option.
There's a decent (IMHO) disambiguator available in LanguageTool. It is available also online:
http://community.languagetool.org/analysis/analyzeText
And this is how we analyze „Ewidentnie o coś chodziło”:
- SENT_START
Ewidentnie ewidentnie adv:pos o o prep:acc coś coś subst:sg:acc:n2 chodziło chodzić verb:praet:sg:n1.n2:ter:imperf:nonrefl . - SENT_END PARA_END
I wasn't aware of that, pretty cool! Can it be made into (or is it already) a stand-alone component so that people can benefit from it? What algorithm is it based on? I'll look at the sources later on, just being lazy :)
Well, it just requires using LanguageTool as a tagger; I guess only the core module and Polish module are required (and the commandline to use --taggeronly
switch). The algorithm is the same as with the rules: simply pattern matching and rewriting. Yes, this is hand-crafted.
For the issue details, please see: https://github.com/monterail/elasticsearch-analysis-morfologik/issues/6