Closed localjo closed 8 years ago
Oops, got the regex wrong on this. One second. 😬
I'll add tests for it too. :)
I added an option called ignoreDigits
that defaults to true, but if someone wants to turn it off they can pass ignoreDigits: false
into the options. I added three tests too. @wooorm let me know what you think. :)
Just thinking this should probably exclude decimal numbers too, like 3.14
, and probably some other forms. Maybe trying to parse the string withparseInt()
would be a good way to determine if it's a valid number or not?
Maybe if we turn it around, based on word-characters, so first do a check if something looks like a word? I know that Hunspell, by default, treats all characters which have a lower- and upper variant as word-characters, and extra characters can be added by .aff
files (there’s some weird stuff in non-English languages)
That's probably a good idea. It might reduce a lot of other non-words being flagged too. Things like 9-5
or 20/20
. I'm not so knowledgeable about how this would play out in non-english languages, especially languages that don't have latin scripts. For my needs, that's not relevant, but obviously this code is supposed to work with more than just English. I'll do some research on how to detect "word-characters" and lower/upper variants. Do you have any thoughts on what a good test would be? I don't think [a-zA-Z]
is sufficient.
The proper solution would be to use the, by an affix file defined WORD_CHARS, in addition to upper- and lower-casable characters, I think. I don’t think the current spell checker gives access to that though, so let’s do that later.
For now, I think we should:
I believe hyphens should be check both ways as well, as they can be used in not-so-fast
as well, and basically chain any words together from what I gather.
I added code that will try each text node if a complete WordNode
fails. I also added tests for that code. Breaking up the words that way also has the added effect of not flagging 3.14
, 123^10
, 9-5
or 24/7
. It does flag 1.5e20
, but that might be good enough for now. What do you think?
Cool! 👍
Now, a word fails if the whole thing, or any text node is unknown.
Alternatively, we could check the word node, and if that fails, check every text node on its own. That way, having not-so-faast
would get a warning for faast
, not for the whole thing.
I have a slight preference for the latter, you?
I tend to think of hyphenated words as a single unit, so I have a preference for the former. For example, if you've got the word "merry-go-round" and it's misspelled as "mary-go-round", you'd want to to have a warning for "mary-go-round" being misspelled, not for "mary". In the situation you mentioned, where someone is sort of inventing a hyphenated word on the fly, I don't think it would be incorrect to get a message that says "not-so-faast" is misspelled, even though technically only the "fasst" portion is misspelled.
So my vote is for hyphenated words to be treated as a single unit.
I do wonder if using a spell checker that has suggestions would change how we want to handle these things. For example, my browser knows that "marry-go-round" should be "merry-go-round", but if we simply break that up into ['marry', 'go', 'round']
we wouldn't catch that. That's probably a separate issue though and one to tackle later.
Ok! 👍
Aside: There’s also AT&T
, which is one word, and B.Sc.
though, not sure about those either.
I'm thinking words like AT&T
or B.Sc.
would either be in the dictionary or added to the ignore
array. Adding/ignoring words that aren't in the dictionary is another issue I'd like to discuss, but maybe in a different thread.
@wooorm Are there any remaining blockers to merging this that I can help clear?
No :) 👍 Did a small refactor to the index.js
though!
This fixes #5
In addition to words in the
ignore
array, "words" that are just digits will also be excluded.