morfologik / morfologik-stemming

Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.
BSD 3-Clause "New" or "Revised" License
187 stars 44 forks source link

Analyzer finds tokens that haven't been mentioned in original string #27

Closed pySilver closed 10 years ago

pySilver commented 10 years ago

For the issue details, please see: https://github.com/monterail/elasticsearch-analysis-morfologik/issues/6

dweiss commented 10 years ago

I'll look at it.

dweiss commented 10 years ago

This is not a bug in Morfologik. Here is the output from parsing your input:

> 'nie'
  - nie, conj+qub  - on, ppron3:pl:acc:m2.m3.f.n1.n2.p2.p3:ter:akc.nakc:praep+ppron3:sg:acc:n1.n2:ter:akc.nakc:praep
> 'zabrakło'
  - zabraknąć, verb:praet:sg:n1.n2:ter:perf:nonrefl
> 'oczywiście'
  - oczywiście, adv:pos+qub
> 'wpadek'
  - wpadka, subst:pl:gen:f
> 'największym'
  - duży, adj:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:sup+adj:sg:inst:m1.m2.m3.n1.n2:sup+adj:sg:loc:m1.m2.m3.n1.n2:sup  - wielki, adj:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:sup+adj:sg:inst:m1.m2.m3.n1.n2:sup+adj:sg:loc:m1.m2.m3.n1.n2:sup
> 'zaskoczeniem'
  - zaskoczenie, subst:sg:inst:n2  - zaskoczyć, ger:sg:inst:n2:perf:aff:refl.nonrefl
> 'okazał'
  - okazać, verb:praet:sg:m1.m2.m3:ter:perf:refl.nonrefl
> 'się'
  - się, siebie:acc:nakc+siebie:gen:nakc+subst:sg:nom:n2
> 'dla'
  - dla, prep:gen
> 'nas'
  - my, ppron12:pl:acc:m1.m2.m3.f.n1.n2.p1.p2.p3:pri+ppron12:pl:gen:m1.m2.m3.f.n1.n2.p1.p2.p3:pri+ppron12:pl:loc:m1.m2.m3.f.n1.n2.p1.p2.p3:pri
> 'strój'
  - strój, subst:sg:acc:m3+subst:sg:nom:m3  - stroić, verb:impt:sg:sec:imperf:refl.nonrefl
> 'katarzyny'

> 'zielińskiej'

> 'której'
  - który, adj:sg:dat:f:pos+adj:sg:gen:f:pos+adj:sg:loc:f:pos
> 'ewidentnie'
  - ewidentnie, adv:pos
> 'o'
  - o, interj+prep:acc+prep:loc  - ocean, brev:pun  - ojciec, brev:pun
> 'coś'
  - coś, qub+subst:sg:acc:n2+subst:sg:gen:n2+subst:sg:nom:n2
> 'chodziło'
  - chodzić, verb:praet:sg:n1.n2:ter:imperf:nonrefl
> 'ale'
  - ale, conj+qub
> 'wciąż'
  - wciąż, adv
> 'nie'
  - nie, conj+qub  - on, ppron3:pl:acc:m2.m3.f.n1.n2.p2.p3:ter:akc.nakc:praep+ppron3:sg:acc:n1.n2:ter:akc.nakc:praep
> 'wiemy'
  - wiedzieć, verb:fin:pl:pri:imperf:nonrefl+verb:fin:pl:pri:imperf:refl.nonrefl
> 'o'
  - o, interj+prep:acc+prep:loc  - ocean, brev:pun  - ojciec, brev:pun
> 'co'
  - co, comp+prep:acc+prep:nom+qub  - co, subst:sg:acc:n2  - co, subst:sg:gen:n2  - co, subst:sg:nom:n2

You have to look upstream at where this gets messed up. I've committed the sample code to produce the above to the repository.

dweiss commented 10 years ago

Damn... wait... the ocean/ ojciec is there!

dweiss commented 10 years ago

o o interj+prep:acc+prep:loc o ocean brev:pun o ojciec brev:pun

This seems to be a problem in the latest source forms dictionary. @milekpl can you confirm?

milekpl commented 10 years ago

This is a feature of expanding abbreviations. Disambiguation is up to the user. Marcin 11 sie 2014 15:29 "Dawid Weiss" notifications@github.com napisał(a):

o o interj+prep:acc+prep:loc o ocean brev:pun o ojciec brev:pun

This seems to be a problem in the latest source forms dictionary. @milekpl https://github.com/milekpl can you confirm?

— Reply to this email directly or view it on GitHub https://github.com/morfologik/morfologik-stemming/issues/27#issuecomment-51773420 .

dweiss commented 10 years ago

This will look like a bug to most people. I honestly thought they would not be abbreviations... to be honest, I don't think such entries make a lot of sense. Not without punctuation... Or a disambiguation engine.

In any case, stripping brev:pun (by tag) would be probably a sensible thing to do in any downstream projects.

milekpl commented 10 years ago

This is also how Morfeusz works (or will work). Simply, morphological analysis requires disambiguation.

dweiss commented 10 years ago

Many people use morfologik-stemming for simple stemming, without full morphological analysis. We don't have a publicly available disambiguation engine too, so this isn't a (realistic) option.

milekpl commented 10 years ago

There's a decent (IMHO) disambiguator available in LanguageTool. It is available also online:

http://community.languagetool.org/analysis/analyzeText

And this is how we analyze „Ewidentnie o coś chodziło”:

- SENT_START 

Ewidentnie ewidentnie adv:pos o o prep:acc coś coś subst:sg:acc:n2 chodziło chodzić verb:praet:sg:n1.n2:ter:imperf:nonrefl . - SENT_END PARA_END

dweiss commented 10 years ago

I wasn't aware of that, pretty cool! Can it be made into (or is it already) a stand-alone component so that people can benefit from it? What algorithm is it based on? I'll look at the sources later on, just being lazy :)

milekpl commented 10 years ago

Well, it just requires using LanguageTool as a tagger; I guess only the core module and Polish module are required (and the commandline to use --taggeronly switch). The algorithm is the same as with the rules: simply pattern matching and rewriting. Yes, this is hand-crafted.