retextjs / retext-spell

plugin to check spelling
https://unifiedjs.com
MIT License
73 stars 16 forks source link

Exclude non-words from spell checker #6

Closed localjo closed 8 years ago

localjo commented 8 years ago

This fixes #5

In addition to words in the ignore array, "words" that are just digits will also be excluded.

localjo commented 8 years ago

Oops, got the regex wrong on this. One second. 😬

localjo commented 8 years ago

I'll add tests for it too. :)

localjo commented 8 years ago

I added an option called ignoreDigits that defaults to true, but if someone wants to turn it off they can pass ignoreDigits: false into the options. I added three tests too. @wooorm let me know what you think. :)

localjo commented 8 years ago

Just thinking this should probably exclude decimal numbers too, like 3.14, and probably some other forms. Maybe trying to parse the string withparseInt() would be a good way to determine if it's a valid number or not?

wooorm commented 8 years ago

Maybe if we turn it around, based on word-characters, so first do a check if something looks like a word? I know that Hunspell, by default, treats all characters which have a lower- and upper variant as word-characters, and extra characters can be added by .aff files (there’s some weird stuff in non-English languages)

localjo commented 8 years ago

That's probably a good idea. It might reduce a lot of other non-words being flagged too. Things like 9-5 or 20/20. I'm not so knowledgeable about how this would play out in non-english languages, especially languages that don't have latin scripts. For my needs, that's not relevant, but obviously this code is supposed to work with more than just English. I'll do some research on how to detect "word-characters" and lower/upper variants. Do you have any thoughts on what a good test would be? I don't think [a-zA-Z] is sufficient.

wooorm commented 8 years ago

The proper solution would be to use the, by an affix file defined WORD_CHARS, in addition to upper- and lower-casable characters, I think. I don’t think the current spell checker gives access to that though, so let’s do that later.

For now, I think we should:

I believe hyphens should be check both ways as well, as they can be used in not-so-fast as well, and basically chain any words together from what I gather.

localjo commented 8 years ago

I added code that will try each text node if a complete WordNode fails. I also added tests for that code. Breaking up the words that way also has the added effect of not flagging 3.14, 123^10, 9-5 or 24/7. It does flag 1.5e20, but that might be good enough for now. What do you think?

wooorm commented 8 years ago

Cool! 👍

Now, a word fails if the whole thing, or any text node is unknown. Alternatively, we could check the word node, and if that fails, check every text node on its own. That way, having not-so-faast would get a warning for faast, not for the whole thing.

I have a slight preference for the latter, you?

localjo commented 8 years ago

I tend to think of hyphenated words as a single unit, so I have a preference for the former. For example, if you've got the word "merry-go-round" and it's misspelled as "mary-go-round", you'd want to to have a warning for "mary-go-round" being misspelled, not for "mary". In the situation you mentioned, where someone is sort of inventing a hyphenated word on the fly, I don't think it would be incorrect to get a message that says "not-so-faast" is misspelled, even though technically only the "fasst" portion is misspelled.

So my vote is for hyphenated words to be treated as a single unit.

localjo commented 8 years ago

I do wonder if using a spell checker that has suggestions would change how we want to handle these things. For example, my browser knows that "marry-go-round" should be "merry-go-round", but if we simply break that up into ['marry', 'go', 'round'] we wouldn't catch that. That's probably a separate issue though and one to tackle later.

wooorm commented 8 years ago

Ok! 👍

Aside: There’s also AT&T, which is one word, and B.Sc. though, not sure about those either.

localjo commented 8 years ago

I'm thinking words like AT&T or B.Sc. would either be in the dictionary or added to the ignore array. Adding/ignoring words that aren't in the dictionary is another issue I'd like to discuss, but maybe in a different thread.

localjo commented 8 years ago

@wooorm Are there any remaining blockers to merging this that I can help clear?

wooorm commented 8 years ago

No :) 👍 Did a small refactor to the index.js though!