Spelling corrector - Githubissues

huonw commented 10 years ago

e.g. http://norvig.com/spell-correct.html

sinistersnare commented 10 years ago

I would love to (, at least attempt to, ) tackle this!

matthieu-m commented 10 years ago

Note: LLVM also implements the Levenshtein edit distance (see here). The main advantage is that unlike Norvig's implementation the memory consumption is very small: O(min(m, n)) where m and n are the length of the corresponding words.

sinistersnare commented 10 years ago

@matthieu-m norvig does not implement the Levenshtein distance algorithm to determine the edit distance, but he makes the function which returns all edits of a word. The LLVM function calculates the Levenshtein distance, much like rust does in fn lev_distance(&self, &str) -> uint

(did i read that correctly?)

sinistersnare commented 10 years ago

I have initial work done on the spelling corrector, it uses almost the same algorithm as Norvig does, but i mostly used this java version. Now that this work is done, I am going to fix the spelling function to use a Vec instead of a HashMap and to make the code better looking in general.

huonw commented 10 years ago

General comments:

unicode words won't be handled correctly, e.g. checking a doc comment like /// öäå will likely crash. I think you want to be iterating with for (i, _) in word.char_indices() rather than for i in range(0, word.len()) (since that's a byte length)
it would be nice to be lazier with an iterator, rather than filling a Vec with all possible edits.

sinistersnare commented 10 years ago

@huonw thanks for the great feedback! Im going to make this more idiomatic, and your feedback is most helpful, if you see anything else (if you even look more, with your busy schedule), I would love to hear more.

How do you see handling of Unicode? The dictionaries used do not seem to care about unicode, should we? I think that instead of crashing, we should just ignore it?
edits returning an Iterator of Strings rather than a Vec is interesting, I have not done much any work with custom rust iterators, so I will see what I can do!

Thanks for the feedback! Work continuing now!

huonw commented 10 years ago

How do you see handling of Unicode? The dictionaries used do not seem to care about unicode, should we? I think that instead of crashing, we should just ignore it?

What do you mean by this? The /usr/share/dict/words on this computer contains a variety of non-ASCII characters, and there's no reason that this library can't be used with dictionaries of other languages. (At the very least, the plain dictionary look-up to see if a word is misspelt should be maintained, even if no suggestions are made.)

Using Norvig's algorithm for creating suggestions is unreasonable in general, but the non-ASCII characters in English dictionaries (well, the word list on this computer) seem to be quite limited (listed with count), so it would be a reasonable stop-gap hack to extend the list of search characters with those ones.

A proper solution would require doing more research into these sort of algorithms (or doing more thinking).

edits returning an Iterator of Strings rather than a Vec is interesting, I have not done much any work with custom rust iterators, so I will see what I can do!

You could also just take a |&str| closure, rather than writing a whole iterator. This closure approach would also allow you to reuse the String allocation, to avoid having to thrash the memory allocator.

The measurements above were created with:

grep -P '[\x80-\xFF]' /usr/share/dict/words
grep -P '[\x80-\xFF]' /usr/share/dict/words | tr "a-zA-Z'" '\n' | grep -v '^$' | sort -n | uniq -c | sort -n

sinistersnare commented 10 years ago

The /usr/share/dict/words on this computer contains a variety of non-ASCII characters

Yeah sorry about that, I am testing this on windows so I just downloaded a list of English words and blindly assumed /usr/share/dict/words would just be english also.

You could also just take a |&str| closure

I will look into that too

Without proper training of words then the spelling corrector is very likely to give the incorrect answer, and for proper training we would need a good source of multi lingual training information. The original goal was to just use /usr/share/dict/words but like i said, it doesn't seem too feasible.

I probably just do not have the theory to do this, so I would like to remove myself from saying "ill do this," of course I want to continue to work on this, but I do not think i will get very far. :(

huonw commented 10 years ago

No, definitely keep working on it; I will accept something that doesn't handle non-ASCII fully (or at all), as long as it doesn't crash, and still states that a word is misspelled (i.e. it doesn't need to handle suggestions/corrections).

I was just saying what the 'proper' fix would be long term; having some corrections is better than having none, even if it doesn't handle every case.

Yeah sorry about that, I am testing this on windows so I just downloaded a list of English words and blindly assumed /usr/share/dict/words would just be english also.

It is (mostly) English; the words I listed above may be mainly borrowed from other languages, but most of them are proper English words (other than the names, that is). As this shows, it's wrong to assume that English means ASCII.

However, I think you can make progress and handle the majority of the non-ASCII English borrowed words by adding áâäåçèéêíñóôöûü to your set of letters.

sinistersnare commented 10 years ago

Ok, do you think you can post the contents of your /usr/share/dict/words in a tar.gz somewhere? I can not find a suitable windows equivalent that bundles it into a single file. That might help in the unicode front with me on windows. I will continue working.

I just added the words from your ix.io post to my words dictionary, so hopefuly itll help a bit with that.

huonw commented 10 years ago

http://ix.io/dsV (~1MB)

huonw commented 10 years ago

Sorry for missing all your pings on IRC, @sinistersnare, it looks like we're just in the wrong timezones. (I'm asleep while you're around, and you're asleep/away while I'm around). Feel free to just ask me questions here. :)

huonw / spellck

Spelling corrector #4