[Discussion] More than checking if a word is in the dictionary

saona-raimundo commented 2 years ago

Hi! I was reading about Hunspell and all the information that is packed in this format. I noticed that there are a few tasks that Hunspell tries to solve that might cause problems with the current implementation approach. In particular, I see a problem with the morphological analysis.

Currently, as far as I understood from the code, the list of words is a HashSet, which is really nice to check if a word is in the list.

The problem comes with homonyms. Let's take Hunspell's example of the English word work. The dictionary is

work    [VERB]
work    [NOUN]

For simplicity, let's assume there are no affixes. Then, when querying the word work, having only a list of admissible words in a HashSet will not be able to give back if work is in the list as a [VERB] or as a [NOUN].

Have you thought about what data structures you would like to use to cover all features in the README?

tgross35 commented 2 years ago

There's a lot more to go on this project to reach feature parity with Hunspell - which is my eventual goal, but I've been aiming to do some architecting and benchmarking to figure out what works best.

HashSet<String> is quite fast for simple exists/not exists checking (outperformed hunspell in my simple benchmarks) but of course doesn't handle cases where there is any sort of metadata. For that, I have a very rough plan of something like:

struct Dictionary {
    config: Config,
    wordlist: HashSet<Entry>,
    wordlist_nosuggest: HashSet<Entry>,
    wordlist_forbidden: HashSet<Entry>
    meta: Vec<Weak<Meta>>
    // ...
}

struct Entry {
    word: String,
    meta: Vec<Rc<Meta>>
}

struct Meta {
    // Any info relevant to that entry including homonyms & source
}

in which case the hash for Entry will be just the hash of word, so true/false checking (the most common case) should still be very fast. The initial compile of the dictionary may take a bit longer to retain this information, but my goal with the library is to minimize check time instead.

All that being said, I'm open to suggestions if you have any!

tgross35 commented 1 year ago

Relevant manpages: https://linux.die.net/man/3/hunspell

And good example of function usage (indirectly) from the R docs https://cran.r-project.org/web/packages/hunspell/vignettes/intro.html

tgross35 commented 1 year ago

Hey @saona-raimundo, do you happen to have access access to dictionary files that contain the morphological info required for this?

I'm getting closeish on a fairly large rewrite (#31) that will be able to support these features, but I don't have any reference dictionaries with morph information. The stemming is fine since that's just the root words, but things like part of speech and phonetics need the po:verb ph:f annotations in the source files - I get everything from here https://github.com/wooorm/dictionaries and they don't have it. Not sure if you might know of a better source

Edit: I think I found the source, seems like they exist here https://github.com/en-wl/wordlist

saona-raimundo commented 1 year ago

Glad you found some!

I do not have any, but was thinking about writing my own hunspell dictionaries. I could come up with small examples that could serve as unit tests if you want.

tgross35 commented 1 year ago

With the latest release 0.3.0 I have the function Dictionary.stem(word: &str) working if you'd like to try it out with whatever samples you might want (but it is behind the zspell-unstable flag so it doesn't show up on docs.rs). I'm working on the analyze function, but it's still a todo() in the latest release (also behind the feature gate for that reason). 0.3.0 is a rewrite and the interfaces changed, so just check the docs if you've been playing with some code already.

I'm definitely not opposed to more tests if you have specific things you'd like this to work for, I can try out anything in the format of this template https://github.com/pluots/zspell/blob/df06c77705e0f18a28d6987e9139d7d1a8a3df95/crates/zspell/tests/files/0_example.test

tgross35 commented 1 year ago

@saona-raimundo it's been a while, but I am happy to say this is finally done! See the entry API which should allow for these things https://docs.rs/zspell/latest/zspell/struct.WordEntry.html.

Some cases of specifying the morph info directly in the dictionary file do not yet work, this is just blocked on a rewrite of the dictionary parser that I am eventually going to do.

I am pretty low on test cases so if you come across any good examples, let me know. They just need to look like https://github.com/pluots/zspell/blob/72c22cf8d5647a9ccbdea12534c834643f267aef/zspell/tests/suite/stemming_morph.test (feel free to submit a PR directly!). If you try it out, let me know if you run into any issues

pluots / zspell

[Discussion] More than checking if a word is in the dictionary #27