streetsidesoftware / vscode-spell-checker

A simple source code spell checker for code
https://streetsidesoftware.github.io/vscode-spell-checker/
Other
1.41k stars 125 forks source link

C/C++: Some misspelled words are not detected #345

Open ambrop72 opened 5 years ago

ambrop72 commented 5 years ago

Misspelled words which are not detected: avalible, handeled, evalulated, deciced, pressent, senting.

Jason3S commented 5 years ago

It detected those words for me.

image

I searched all the dictionaries, the words were not found. What programming language are you using?

ambrop72 commented 5 years ago

Hi, thanks for looking into this. I'm using C++ (and the official C++ extension). I didn't do any special setup of the spell checker other than which file types are checked and adding some words to the workspace dictionary (definitely not these ones, I will check).

Jason3S commented 5 years ago

It is because most people while programming in c++ glue words together: errorhandler, to account for that, the spell checker allows for compound words. Your examples include multiple valid words: dec°iced, press°ent, han°deled

Jason3S commented 5 years ago

I to do not think the way it currently works is ideal.

The plan is to change C/C++ compound matching to match against noun compounds instead of compounds made up of all words. errorcode, resturncode, htmlelement, messagehandler, errormessage would all be the kinds of stuff it would think is correct. This would help with suggestions as well. Things like noun{1,3} or (verb)(noun){,3}.

kkaja123 commented 5 years ago

I am running into this issue during code reviews and it causes quite a bit of grief. Specifically, words like evalute and GetMsgSrollTime (should be GetMsgScrollTime) are not being detected.

Many developers use a naming convention to separate individual words in an identifier name (e.g. camelCase, PascalCase, and snake_case). It would be great if the extension could take advantage of this to check for misspelled words. Unfortunately, it is difficult to predict which naming convention a developer is using. Therefore, an option to control how the extension parses compound words could work well for this issue. The option would have a checklist of common compound word naming conventions (including compound words using all lowercase letters) that the extension would know to treat as compound words. If camelCase is enabled, evalUte would be okay. If camelCase is disabled, evalUte would be treated as one word, "evalute," and would be incorrect. If alllowercase is enabled, evalute would be okay. If alllowercase is disabled, evalute would be incorrect. I think you get the picture.

There's also the case where any any naming convention could be used (non-common ones). Considering something like eVaLUtE, if any naming convention is enabled, the extension would behave as it does today (does not detect individual words based on case changes). The extension would interpret that word as eVaL + UtE. I hope it's obvious that nobody would think that eVaLUtE is correctly spelled, since humans follow patterns instead of chaos.

kit1980 commented 4 years ago

I see the same with Python. For example, "singal" is not detected, presumably because it's "sing" + "al".

I thought that the cSpell.allowCompoundWords would control this behavior, but apparently not...

Jason3S commented 4 years ago

@kit1980 you are right, it is because allowCompoundWords is turned on for Python and C/C++.

To turn off allowCompoundWords for a language, you need to override it at the language level:

The following will turn off compound word matching for C/C++ and Python:

    "cSpell.languageSettings": [
        {
            "languageId": "c,cpp,python",
            "allowCompoundWords": false
        }
    ]
jharrang commented 4 years ago

@kit1980 you are right, it is because allowCompoundWords is turned on for Python and C/C++.

To turn off allowCompoundWords for a language, you need to override it at the language level:

The following will turn off compound word matching for C/C++ and Python:

    "cSpell.languageSettings": [
        {
            "languageId": "c,cpp,python",
            "allowCompoundWords": false
        }
    ]

Thanks for the fix! This should really be the default behavior IMHO (or at least the default behavior needs some case-matching refinement). Add me to the list of people who pushed code with typos because of this.

Jason3S commented 4 years ago

My plan is to turn allowCompoundWords off by default. To do that, I have been working on a way to define compoundable words. It is a simple syntax:

error*
*code
+infix+
+msg

* - optional compound + - required compound

With this definition valid words are:

error, code, errorcode, errormsg, errorinfixmsg

The follow are some of the not allowed words:

codemsg, msg
PEZ commented 2 years ago

Is this the reason why servie isn't correctly checked? In a plain text file:

https://user-images.githubusercontent.com/30010/156350957-74843052-b408-4481-9bae-6f75b3ae7aa9.mp4

PEZ commented 2 years ago

Is this the reason why servie isn't correctly checked?

Yes it was. Sorry for the noice.

Jason3S commented 2 years ago

@PEZ,

You can use the cspell trace command to check.

npx cspell trace --language-id=cpp servie
image
PEZ commented 2 years ago

Ah. sweet!

mwermelinger commented 1 year ago

The setting for compound words tell us that it might make misspelled words look correct. It would be nice to also tell us the setting can be disabled per language. I was getting frustrated with all the undetected typos in Markdown, like insructions, but disabling compound words for Markdown helps a lot. I'd rather have false positives (flagged correct word) than false negatives (misspelled word not flagged). Is there also a way to disable in code comments, i.e. compound words would only be allowed in code, not in natural language text?

Jason3S commented 1 year ago

@mwermelinger,

allowCompoundWords is now off by default. It has been the cause of many complaints.

I continue to strongly urge not setting allowCompoundWords to true.

I think a better practice is to just add the common compound words to a custom dictionary.

It is possible to define a custom compound dictionary:

cspell.config.yaml

dictionaryDefinitions:
  - name: code-compounds
    description: Custom Dictionary for compound words
    path: ./compound-words.txt
    addWords: true

languageSettings:
  - caseSensitive: false
    languageId: cpp,c,python,javascript
    dictionaries:
      - code-compounds

compound-words.txt

*code*
*error*
*errors*
*help*
+end
begin+
+middle+
array

Only words with * or + will be combined.

mwermelinger commented 1 year ago

Jason, thanks for the reply but I'm afraid I don't understand the approach of having to explicitly list the compound words. How would cSpell accept identifiers like dayTimeUserMessage, unless we add all those (and many other) words to the dictionary? Seems a very labour intensive approach to add words to the dictionary as needed, unless I'm missing some point. Thanks in advance for any clarification.

mwermelinger commented 1 year ago

Forget it. Senior moment: snake and camel case are not considered compound words.

Jason3S commented 1 year ago

snake and camel case are not considered compound words.

Exactly. The spell checker is able to split snake and camel case. It even will handle ERRORcode and ERRORCode. With a identifier like ERRORCode it will try both (ERRORC, ode) and (ERROR, code). It will handle IFrame, but not iframe.

Using the compound syntax above the following is considered correct:

Not accepted: