languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.39k stars 1.39k forks source link

Ignoring snake_case words by Hunspell #10137

Open anikitin opened 9 months ago

anikitin commented 9 months ago

I am using LanguageTool to verify API documentation. Some documentation fragments include variable names in snake_case.

It would be very helpful to have a speller option to ignore words in snake case similar to one for camel case, e.g. fsa.dict.speller.ignore-snake-case

jaumeortola commented 9 months ago

This answer is valid here.

anikitin commented 9 months ago

@jaumeortola , thanks for your suggestion. I added the following rule to disambiguation.xml:

    <rule id="IGNORE_SNAKE_CASE" name="ignore words in snake_case">
        <pattern>
            <token regexp="yes">\p{Ll}+_\p{Ll}.*</token>
        </pattern>
        <disambig action="ignore_spelling"/>
    </rule>

It mostly works but I found a strange case where it failed to ignore snake case word:

"See the description of "ivr_pin" parameter" --> "Possible spelling mistake found."

at the same time:

"See the description of "new_pin" parameter" --> no matches.

(BTW, "ivrPin" in camelCase is correctly ignored by another rule you suggested)

Any ideas why it is happening?

jaumeortola commented 9 months ago

If you are using English, the problem is the tokenization.

You will need a rule like this one:

    <rule id="IGNORE_SNAKE_CASE" name="ignore words in snake_case">
        <pattern>
            <token regexp="yes">\p{Ll}+</token>
            <token spacebefore="no">_</token>
            <token regexp="yes" spacebefore="no">\p{Ll}.*</token>
        </pattern>
        <disambig action="ignore_spelling"/>
    </rule>

This rule should also work for snake_case_with_multiple_underscores.

anikitin commented 9 months ago

Rocket science! Thanks a lot @jaumeortola !

Another case I am currently struggling with is ignoring the entire word if it is in back quotes. It helps to avoid false positives about attributes like "ivr". Would you be so kind to recommend how to define pattern for this case? Should it look like

<token spacebefore="no">`</token>
<token regexp="yes">\p{Ll}+</token>
<token spacebefore="no">`</token>
jaumeortola commented 9 months ago

Maybe this:

<token>`</token>
<token spacebefore="no" regexp="yes">\p{Ll}+</token>
<token spacebefore="no">`</token>
anikitin commented 9 months ago

Thanks again, @jaumeortola. Works like a charm!

anikitin commented 9 months ago

@jaumeortola, sorry to trouble you with questions again. But is there any simple way to exclude from spell check everything within backticks including the characters which are treated as delimiters like "=", "-", even spaces. I suspect that should be the similar construction as you suggested for underscore. Should I just list all such symbols non in the form of regexp but as individual tokens? Thanks!

jaumeortola commented 9 months ago

It will be easier for me to understand the question with an example. Seeing a sentence example, I can tell you how to write the pattern.

anikitin commented 9 months ago

Sure, it is all about some API usage fragments.

For example, I can have an embedded API call fragment within some markdown method description, e.g.

Ticks in markdown indicate a monospace fragment which is only used to highlight code fragments in our case. And these fragments need to be excluded from spell/grammar checks to reduce number of false positives.

Hope this explanation helps.

Thanks in advance!

anikitin commented 9 months ago

@jaumeortola , any hints here ^^^. Thanks!

jaumeortola commented 9 months ago

Something like this should cover most cases:

<rule>
    <pattern>
        <token>`</token>
        <token spacebefore="no" skip="-1"><exception>`</exception><exception scope="next">`</exception> </token>
        <token spacebefore="no">`</token>
    </pattern>
    <disambig action="ignore_spelling"/>
</rule>
anikitin commented 9 months ago

Hmmm, @jaumeortola, doesn't work for me. For the line "Use pid=1 parameter", the result is "Possible spelling mistake found."

    <rule id="IGNORE_WORDS_IN_BACKQUOTES" name="ignore words within backquote characters">
        <pattern>
            <token>`</token>
            <token spacebefore="no" skip="-1"><exception>`</exception><exception scope="next">`</exception> </token>
            <token spacebefore="no">`</token>
        </pattern>
        <disambig action="ignore_spelling"/>
    </rule>
jaumeortola commented 9 months ago

Some elements of the syntax don't work in the disambiguation file (because we have never needed it). The only quick solution is this. One rule for each pattern with a fixed number of tokens.

<rulegroup id="IGNORE_WORDS_IN_BACKQUOTES" name="ignore words within backquote characters">
    <rule>
        <pattern>
            <token>`</token>
            <token spacebefore="no"><exception>`</exception></token>
            <token spacebefore="no">`</token>
        </pattern>
        <disambig action="immunize"/>
    </rule>
    <rule>
        <pattern>
            <token>`</token>
            <token spacebefore="no"><exception>`</exception></token>
            <token><exception>`</exception></token>
            <token spacebefore="no">`</token>
        </pattern>
        <disambig action="immunize"/>
    </rule>
    <rule>
        <pattern>
            <token>`</token>
            <token spacebefore="no"><exception>`</exception></token>
            <token><exception>`</exception></token>
            <token><exception>`</exception></token>
            <token spacebefore="no">`</token>
        </pattern>
        <disambig action="immunize"/>
    </rule>
    <rule>
        <pattern>
            <token>`</token>
            <token spacebefore="no"><exception>`</exception></token>
            <token><exception>`</exception></token>
            <token><exception>`</exception></token>
            <token><exception>`</exception></token>
            <token spacebefore="no">`</token>
        </pattern>
        <disambig action="immunize"/>
    </rule>
</rulegroup>