hunspell / hunspell

The most popular spellchecking library.
http://hunspell.github.io/
GNU Lesser General Public License v2.1
2.14k stars 249 forks source link

Condition field correction (dot and 'no more characters') #237

Closed phajdan closed 7 years ago

phajdan commented 9 years ago

Hello, I was making dictionary and faced the situation where two words with similar endings must be in one SFX flag. Plain example,

file.dic

2 work/A rework/A

file.aff

SFX A Y 2 SFX A 0 less work #work --> workless SFX A 0 ing ework #work --> working; rework -> reworking

The target is, second word (rework) must not be affixed with first rule. But at the same time the first word (work) must be. It is logicaly that dot (.) must be the first character in condition field to permit affixing word that has same ending. And if there's no dot (.) at the first place - the condition must mean "only these characters" Like:

SFX A 0 ing .work #rework -> reworking; but no 'work -> working' affixing.

and

SFX A 0 ing work #work -> working; but no 'rework -> reworking' affixing.

So, is it possible to fix this in next release?

Original comment by: sspphheerraa

Original Ticket: hunspell/bugs/270

phajdan commented 9 years ago

I do not understand, why do you insist to have the same class for work and rework.

You can simply say:

file.dic work/A rework/B

file.aff SFX A Y 2 SFX A 0 less #work --> workless SFX A 0 ing #work --> working; rework -> reworking

SFX B Y 1 SFX B 0 ing #work --> working; rework -> reworking

Original comment by: swan46

phajdan commented 9 years ago

Yes, I know. The "work" and "rework" is just abstract example in English. The actual words are Russian and have very much mnemonics for each part of speech. In the way you have suggested the affix file becomes very difficult, and it is good to set one flag for one grammar mnemonic. Anyway it's all the lyrics... IINM, the condition field is 'regular expression'. So, maybe there is way to set my needs by regular staff (without modifying sources). The only thing is to point "no symbol at all" in the appropriate character position.

Original comment by: sspphheerraa

phajdan commented 9 years ago

If your wish can be satisfied without code modification, then I have no objections.

Some thoughts about affix/dict building: Because of language complexity, flags can be any utf8 character (at least 256 characters) and flags can be even 2 characters long, which means 256x256 possible flags.

.aff and .dict files for Hungarian and Finnish, probably also for Turkish (if any) are not manually created but using scripting like perl or awk/shell. They base on the fact, that there are word classes, that can use the same group of flags.

see: http://manpages.ubuntu.com/manpages/dapper/man4/hunspell.4.html

FLAG long SFX Y1 Y 1 SFX Y1 0 s 1

dictionary example using flags Y1, Z3 and F?

foo/Y1Z3F?

Maybe you should consider the above strategy also for Russian, that has doubtless more complexity in affixes, than English or German.

Original comment by: swan46

phajdan commented 9 years ago

If your wish can be satisfied without code modification, then I have no objections.

What metacharacters are allowed in condition field? I have tried $?.()[]^\ but not found any that sets end of word length. But there must be mechanism to calculate word length. Maybe two empty brackets with circumflex [^] can be used for pointing word end.

I see comment in affentry.cxx

// upon entry suffix is 0 length or already matches the end of the word. // So if the remaining root word has positive length // and if there are enough chars in root word and added back strip chars // to meet the number of characters conditions, then test it

isn't it? How it's working?

Original comment by: sspphheerraa

phajdan commented 9 years ago

http://manpages.ubuntu.com/manpages/dapper/man4/hunspell.4.html

(4) condition.

Zero stripping or affix are indicated by zero. Zero condition is indicated by dot. Condition is a simplified, regular expression-like pattern, which must be met before the affix can be applied. (Dot signs an arbitrary character. Characters in braces sign an arbitrary character from the character subset. Dash hasn’t got special meaning, but circumflex (^) next the first brace sets the complementer character set.)

Comment: It is very simple. Intelligence must be in .aff/.dic generation program.

Original comment by: swan46

sspphheerraa commented 7 years ago

This issue can be closed cause has no sense. Now I understand that 'no more characters' can't be implemented because we need affix sample for words of 'any length' and in case of 'no more characters' we would must determine length. And, so, that drops any sense of our spellchecker.

dimztimz commented 7 years ago

In some near of far future more powerful regular expressions (with stuff like ^ and $ which mean start and end of string) somewhere in the whole Hunspell processing. Now, closing this as requested.