inbo / checklist

An R package for checking R packages and R code
https://inbo.github.io/checklist
GNU General Public License v3.0
16 stars 2 forks source link

add spell checking support #90

Closed ThierryO closed 1 year ago

ThierryO commented 2 years ago

Great job! In the code I added some suggestions, mainly in the vignet, and here also some cases with unwanted outcome that you might want to address:

Thanks for the feedback

  • In this line and this line, I get multiple times the remark `+` not found in the dictionary or wordlist for nl-BE referring at each single character of the line

This is a bug due to the fact that + is a special character in regular expressions. I'll fix that.

  • generally, the use of 'quotes' or 'single quotes' ('' around 1 or multiple words) and / to give different options (both as xxx/yyy or xxx / yyy / zzz) is not appreciated (up to you to choose if these should be allowed) (often used in vignet handleiding.Rmd in dhcurve if you need an example)

Use double quotes instead of single quotes. Removing all single quotes before spell checking will lead to other problems. E.g. words like doesn't, 's avonds.

xxx / yyy / zzz seems to be the proper way to write this. I'll remove any / that is surrounded by whitespace. So xxx / yyy / zzz will be checked as xxx yyy zzz, whereas xxx/yyy/zzz will remain xxx/yyy/zzz.

  • the com that is reported as spelling error that I mentioned earlier, is the one on this line (and one 3 lines later, but not the others in the document)

URL's in markdown format [x](y) are replaced with the link text x. So [GitHub.com](https://github.com) is changed into GitHub.com. Plain text URLs are removed. GitHub.com is technically not an URL. https://github.com is. You should write this either as [GitHub](https://github.com) or https://github.com.

ElsLommelen commented 2 years ago

Concerning setup_project(): I noticed it only adds a checklist.yml file, while setup_source() adds a folder .github with github actions as well. Will setup_project() be elaborated to have the same functionality (but with the option to choose which checks have to be done automatically)?

And more generally: will there be an option to add the spelling check to the github actions (both for projects and packages)?

ThierryO commented 2 years ago

setup_project() is not finished yet. It should at least be able to set the same things as setup_source(). Likewise there will be a check_project() instead of check_source().

Adding check_spelling() to the GHA is on the to do list. It will be a part of check_package() (mandatory check) and check_project() (optional check).

ElsLommelen commented 2 years ago

Another consideration while using it: would it be possible to put the documents in the Markers tab in alphabetic order in such way that they are also ordered by folder (similar to lintr markers)? As the markers point to the .Rd files and not to the function documentation itself, it is honestly already a pain in the ass to make corrections (e.g. put function and variable names between backticks, as I didn't do this years ago when I made the package...), and a constant switch between .Rd files in the man folder and .Rmd files elsewhere in the package, makes it even harder.

And considering this: are you sure you want to make it mandatory for packages? It may refrain people from using the checks altogether, or they may just put all the markers in the dictionary to get rid of this nuisance (by which it is useless). ;-)

(Maybe some more comments will follow in the next hours of days, I just take advantage of this open PR as long as it exists. ;-) )

ElsLommelen commented 2 years ago

An issue that may be typical to the Dutch language: the use of hyphens. Some examples to illustrate:

I'm not sure this can be solved on a general level?

ThierryO commented 2 years ago

Another consideration while using it: would it be possible to put the documents in the Markers tab in alphabetic order in such way that they are also ordered by folder (similar to lintr markers)?

Currently I order them based on the number of issues per file. Have the most problematic file first seem convenient to me. I can order them along the relevant path.

As the markers point to the .Rd files and not to the function documentation itself, it is honestly already a pain in the ass to make corrections (e.g. put function and variable names between backticks, as I didn't do this years ago when I made the package...), and a constant switch between .Rd files in the man folder and .Rmd files elsewhere in the package, makes it even harder.

Yes, that would be a nice thing to have. However is it much harder to do. Although the .Rd file states which file to edit, in pratice you still might need to edit another file. E.g. when you use @inheritParms or @template. The problem might even be outside your package. E.g. when you use @inheritParms from a different package.

So don't expect this to be fixed in this PR. Maybe in a future version.

And considering this: are you sure you want to make it mandatory for packages? It may refrain people from using the checks altogether, or they may just put all the markers in the dictionary to get rid of this nuisance (by which it is useless). ;-)

Why would some not want to correct the spelling in their package? And that same person is OK with check_lintr()? I'd expect more opposition against check_lintr() than against check_spelling(). If check_spelling() is a lot of work to fix, then either they use a ton of specialised words or their texts contain lots of typos.

(Maybe some more comments will follow in the next hours of days, I just take advantage of this open PR as long as it exists. ;-) )

ThierryO commented 2 years ago

An issue that may be typical to the Dutch language: the use of hyphens. Some examples to illustrate:

  • zomer- en wintereik gives a spelling error for zomer-
  • html-bestand or `.html`-bestand I'm not sure this can be solved on a general level?

Note that `.html`-bestand is viewed as -bestand by the spell checker.

I think it is only safe to the replace the hyphen in case it is in the middle of the word (html-bestand becomes two words html and bestand). Or when the word starts with a hyphen (-bestand is checked a bestand). I'm reluctant to change that for words ending with an hyphen. They might not be a correct word on themselves. In such case, add them to the custom dictionary.

ThierryO commented 2 years ago

Removing hyphen seems not a good idea. The check would fail to detect words were a mandatory hyphen is missing.

ElsLommelen commented 2 years ago

Yes, that would be a nice thing to have. However is it much harder to do. Although the .Rd file states which file to edit, in pratice you still might need to edit another file. E.g. when you use @inheritParms or @template. The problem might even be outside your package. E.g. when you use @inheritParms from a different package.

So don't expect this to be fixed in this PR. Maybe in a future version.

I can imagine, but for this, ordering by folder and filename (instead of number of issues), i.e. same order as the explorer window, already helps a bit to systematically solve the errors in the files (instead of having to swop from one folder to the other).

Why would some not want to correct the spelling in their package? And that same person is OK with check_lintr()?

There was a collegue who was hesitating on the spelling check. Lintr checks can be largely solved with the styler package, and is is already in the checklist CI for quite some time, so current users already did this, but spelling is a new issue to deal with.

If check_spelling() is a lot of work to fix, then either they use a ton of specialised words or their texts contain lots of typos.

Just try it on an old package that has never been tested before. For dhcurve I got hundreds of remarks:

So even without specialised words, it can ask a lot of work to fix all spelling issues if you do it for the first time, and unfortunately it has to be done by hand.

To me it is fine to have the spell check mandatory in a package, but be aware that it all together (lintr, spell check) becomes quite an effort to add to an existing package. (The advantage to do it anyway, is to have the regular CMD checks and unittests checked during CI.)

ElsLommelen commented 2 years ago

I put function and variable names in the functions documentation into back ticks, and rebuilt the documentation to have the back ticks appear in the .Rd-files, but they are not ignored by check_spelling(), despite the spelling vignette states they should do so. Any idea what could go wrong? An example here.

ThierryO commented 2 years ago

I put function and variable names in the functions documentation into back ticks, and rebuilt the documentation to have the back ticks appear in the .Rd-files, but they are not ignored by check_spelling(), despite the spelling vignette states they should do so. Any idea what could go wrong? An example here.

Add Roxygen: list(markdown = TRUE) to the DESCRIPTION and redocument the package. The backticks will be rendered as \code{}. Then they should be ignored.

ElsLommelen commented 2 years ago

Thanks! This indeed solves the problem.

ElsLommelen commented 2 years ago

xxx / yyy / zzz seems to be the proper way to write this. I'll remove any / that is surrounded by whitespace. So xxx / yyy / zzz will be checked as xxx yyy zzz, whereas xxx/yyy/zzz will remain xxx/yyy/zzz.

Could it be that this issue is only solved for .Rmd files and not for .Rd files (yet)?

ElsLommelen commented 2 years ago

An issue that may be typical to the Dutch language: the use of hyphens. Some examples to illustrate:

  • zomer- en wintereik gives a spelling error for zomer-
  • html-bestand or `.html`-bestand I'm not sure this can be solved on a general level?

Note that `.html`-bestand is viewed as -bestand by the spell checker.

I think it is only safe to the replace the hyphen in case it is in the middle of the word (html-bestand becomes two words html and bestand). Or when the word starts with a hyphen (-bestand is checked a bestand). I'm reluctant to change that for words ending with an hyphen. They might not be a correct word on themselves. In such case, add them to the custom dictionary.

In case of zomer- en wintereik, I 'd prefer to add zomer- en wintereik to to the custom dictionary instead of zomer-, because the latter is not a correct word on itself (like you mention). Is it an option to implement that, or is it too complicated? (It can be useful for expressions as well, but the use is maybe too limited to do the effort?)

(Of course I don't mind adding the word manually to the dictionary file, but for now adding this is not preventing from marking zomer- as an error.)

ThierryO commented 2 years ago

xxx / yyy / zzz seems to be the proper way to write this. I'll remove any / that is surrounded by whitespace. So xxx / yyy / zzz will be checked as xxx yyy zzz, whereas xxx/yyy/zzz will remain xxx/yyy/zzz.

Could it be that this issue is only solved for .Rmd files and not for .Rd files (yet)?

fixed in abb87f7

ThierryO commented 2 years ago

An issue that may be typical to the Dutch language: the use of hyphens. Some examples to illustrate:

  • zomer- en wintereik gives a spelling error for zomer-
  • html-bestand or `.html`-bestand I'm not sure this can be solved on a general level?

Note that `.html`-bestand is viewed as -bestand by the spell checker. I think it is only safe to the replace the hyphen in case it is in the middle of the word (html-bestand becomes two words html and bestand). Or when the word starts with a hyphen (-bestand is checked a bestand). I'm reluctant to change that for words ending with an hyphen. They might not be a correct word on themselves. In such case, add them to the custom dictionary.

In case of zomer- en wintereik, I 'd prefer to add zomer- en wintereik to to the custom dictionary instead of zomer-, because the latter is not a correct word on itself (like you mention). Is it an option to implement that, or is it too complicated? (It can be useful for expressions as well, but the use is maybe too limited to do the effort?)

(Of course I don't mind adding the word manually to the dictionary file, but for now adding this is not preventing from marking zomer- as an error.)

I don't think that would work. hunspell:hunspell() does the spell checking. It considers zomer- en wintereik as three separate words.