ropensci / spelling

Tools for Spell Checking in R
https://docs.ropensci.org/spelling
Other
107 stars 25 forks source link

Support alphanumeric and hyphenated words #45

Open nuno-agostinho opened 4 years ago

nuno-agostinho commented 4 years ago

I am using the following words in my package:

After inserting these words in inst/WORDLIST and running spelling::spell_check_package(), the function reports that the words seq, st, nd and EIF are misspelled.

Currently, my WORDLIST includes the words seq, st, nd and EIF to avoid triggering the spell checker, but I would prefer to include the full words. Thanks.

jmbarbone commented 3 years ago

I have the same issue, picked up with ordinal indicators. It looks like this is a problem with the hunspell parser:

hunspell::hunspell_parse(c("1st", "RNA-seq", "EIF4G1"))
#> [[1]]
#> [1] "st"
#> 
#> [[2]]
#> [1] "RNA" "seq"
#> 
#> [[3]]
#> [1] "EIF" "G"

Created on 2021-02-06 by the reprex package (v0.3.0)

jmbarbone commented 3 years ago

Implementing a pre filter right before the parse here could work:

https://github.com/ropensci/spelling/blob/a2b5f29856b6a067e33d45e29ae3aa4b88ed6176/R/check-files.R#L118-L123

It feels like more of a quick-fix because it parses with strsplit() then paste()s back together before being sent to the actual parsing function.

ignore_words <- c("1st", "RNA-seq", "EIF4G1")

lines <- c(
  "This is the 1st line.  It has first written in it.",
  "The second has RNA-seq inside. But does not use RNAseq -- without the '-'",
  "EIF4G1 but not EIF4G1fdsadf is used",
  "This line's words are fine!"
)

pre_filter_plain <- function(lines, ignore = character()) {
  word_list <- strsplit(lines, "([^-[:alnum:][:punct:]])")

  vapply(
    word_list,
    function(i) {
      paste(i[!i %in% ignore], collapse = " ")
    },
    character(1)
  )
}

pre_filter_plain(lines, ignore_words)
#> [1] "This is the line.  It has first written in it."                   
#> [2] "The second has inside. But does not use RNAseq -- without the '-'"
#> [3] "but not EIF4G1fdsadf is used"                                     
#> [4] "This line's words are fine!"

Created on 2021-02-06 by the reprex package (v0.3.0)