ropensci / hunspell

High-Performance Stemmer, Tokenizer, and Spell Checker for R
https://docs.ropensci.org/hunspell
Other
109 stars 44 forks source link

Example spelling "mistakes" #54

Open statzhero opened 1 year ago

statzhero commented 1 year ago

I don't understand the document and PDF examples. These words, while not in the dictionary, don't seem like mistakes. Would they get corrected?

 [1] "auth"              "CORBA"             "cpu"              
 [4] "cran"              "cron"              "css"              
 [7] "csv"               "CTRL"              "DCOM"             
[10] "de"                "dec"               "decompositions"   
[13] "dir"               "DOM"               "DSL"              
[16] "eol"               "ESC"               "facto"            
[19] "grDevices"         "httpuv"            "ignorable"        
[22] "interoperable"     "JRI"               "js"               
[25] "json"              "jsonlite"          "knitr"            
[28] "md"                "memcached"         "mydata"           
[31] "myfile"            "NaN"               "nondegenerateness"
[34] "OAuth"             "ocpu"              "opencpu"          
[37] "OpenCPU"           "pandoc"            "pb"               
[40] "php"               "png"               "prescripted"      
[43] "priori"            "protobuf"          "rApache"          
[46] "rda"               "rds"               "reproducibility"  
[49] "Reproducibility"   "RinRuby"           "RInside"          
[52] "rlm"               "rmd"               "rnorm"            
[55] "rnw"               "RPC"               "RProtoBuf"        
[58] "rpy"               "Rserve"            "RStudio"          
[61] "saveRDS"           "scalability"       "scalable"         
[64] "schemas"           "se"                "sep"              
[67] "SIGINT"            "STATA"             "stateful"         
[70] "Stateful"          "statefulness"      "stdout"           
[73] "STDOUT"            "suboptimal"        "svg"              
[76] "sweave"            "tex"               "texi"             
[79] "tmp"               "toJSON"            "urlencoded"       
[82] "www"               "xyz"    
jeroen commented 1 year ago

You can review and whitelist them using a custom wordlist.

statzhero commented 1 year ago

Thank you -- to clarify, what is the default in this example: would hunspell correct them with a high probability match here or not?

jeroen commented 1 year ago

hunspell does not do anything, it is a low level tool for finding unknown words, that could be mistakes. You can build on this information using e.g. the spelling package to filter this based on your own wordlist of accepted words.