morfologik / morfologik-stemming

Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.
BSD 3-Clause "New" or "Revised" License
187 stars 44 forks source link

Improve performance in pair-replacements. Fix bug in isNotCapitalized #20

Closed jaumeortola closed 10 years ago

jaumeortola commented 10 years ago

Remarks:

milekpl commented 10 years ago

Jaume, as regards wordLen < MIN_WORD_LENGTH, I never get any sensible suggestions, at least for Polish corpora that I used for testing. The corpora were relatively free of errors, so your mileage may vary but in my experience, we're wasting time generating those. If you get sensible results, we can add a new property but I really want to suppress such suggestions anyway.

jaumeortola commented 10 years ago

In Catalan the suggestions for short words are very much needed and, using the frequency data, the results are very good (qeu>que, uan>una...). This can be not true for other languages. What about English? If you type "aer", you expect the suggestions "are" or "air". Of course, the frequency data is indispensable. It was meant for that. So please add a new property...

milekpl commented 10 years ago

Well, my problem concerns short uppercased acronyms, which cannot possibly have any sensible suggestions at all, but we generate them anyway. So maybe short uppercase words should be suppressed? Anyway, let's consider this later on.