topepo / Cubist

An R package for fitting Quinlan's Cubist regression model
http://topepo.github.io/Cubist
40 stars 12 forks source link

Escaping special characters and umlaut parsing fails #23

Open philipp-baumann opened 5 years ago

philipp-baumann commented 5 years ago

Hi Max,

some colleagues observed that caret::train() with method = "cubist" errors when some special characters in factor values are present in predictors, tracing back to Cubist::cubist().

I really like Cubist because of its speed and straight-forward way of interpreting results. Thanks a lot for your energy invested in this nice and clean R port!

I thought I'll have a look into the issue to figure out a possible solution. Below is some testing of standard ASCII characters, some of them with special roles in Rulequest Cubist, and non-ASCII umlauts, to diagnose the errors, and a suggestion for resolving a part of the issue:

################################################################################
## Description: Special characters are not correctly escaped for values in 
##   Cubist data file
################################################################################

library("mlbench")
library("Cubist")
#> Warning: package 'Cubist' was built under R version 3.4.4
#> Loading required package: lattice
library("tidyverse")
#> ── Attaching packages ──────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
#> ✔ tibble  1.4.2     ✔ dplyr   0.7.8
#> ✔ tidyr   0.8.2     ✔ stringr 1.3.1
#> ✔ readr   1.1.1     ✔ forcats 0.2.0
#> Warning: package 'ggplot2' was built under R version 3.4.4
#> Warning: package 'tibble' was built under R version 3.4.3
#> Warning: package 'tidyr' was built under R version 3.4.4
#> Warning: package 'purrr' was built under R version 3.4.4
#> Warning: package 'dplyr' was built under R version 3.4.4
#> Warning: package 'stringr' was built under R version 3.4.4
#> ── Conflicts ─────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
# Example data set
data(BostonHousing)

# Test with only 2 factorial predictors
boston_housing <- as.tibble(BostonHousing[, c("crim", "zn", "medv")])
# Convert numeric `crim` and `zn` to factors
boston_housing <- boston_housing %>%
  mutate(
    zn = as.factor(zn),
    crim = as.factor(crim)
  )
#> Warning: package 'bindrcpp' was built under R version 3.4.4

## See https://www.rulequest.com/cubist-unix.html for exceptions:
# "What's in a name?
## Special characters (comma, colon, period, vertical bar `|') can appear in
## names and values, but must be prefixed by the escape character `\'. 
## For example, the name "Filch, Grabbit, and Co." would be written as `Filch\,
## Grabbit\, and Co\.'. (However, it is not necessary to escape colons in times 
## and periods in numbers.)"

# Test (1) ASCII, no umlaut ----------------------------------------------------

# Recode factor levels
(boston_housing_chars <- boston_housing %>%
  mutate(
    # Recode respective level of first value
    crim = recode(crim, `0.00632` = "a@_$",  .default = levels(crim)),
    zn = recode(zn, `18` = "a@_$?", .default = levels(zn))
  ) %>%
  rename(`zn|d` = zn))
#> # A tibble: 506 x 3
#>    crim    `zn|d`  medv
#>    <fct>   <fct>  <dbl>
#>  1 a@_$    a@_$?   24.0
#>  2 0.02731 0       21.6
#>  3 0.02729 0       34.7
#>  4 0.03237 0       33.4
#>  5 0.06905 0       36.2
#>  6 0.02985 0       28.7
#>  7 0.08829 12.5    22.9
#>  8 0.14455 12.5    27.1
#>  9 0.21124 12.5    16.5
#> 10 0.17004 12.5    18.9
#> # ... with 496 more rows

# Fine, works
(mod_housing_chars <- 
  cubist(x = boston_housing_chars[, -c(3)], y = boston_housing_chars$medv,
    committees = 10))
#> 
#> Call:
#> cubist.default(x = boston_housing_chars[, -c(3)], y
#>  = boston_housing_chars$medv, committees = 10)
#> 
#> Number of samples: 506 
#> Number of predictors: 2 
#> 
#> Number of committees: 10 
#> Number of rules per committee: 31, 28, 28, 26, 27, 25, 27, 24, 30, 24
# Test (2) Umlaut "ä" ----------------------------------------------------------
# Recode factor levels
(boston_housing_umlaut <- boston_housing %>%
  mutate(
    # Recode respective level of first value
    crim = recode(crim, `0.00632` = "a@_$ä",  .default = levels(crim)),
    zn = recode(zn, `18` = "a@_$?", .default = levels(zn))
  ) %>%
  rename(`zn|d` = zn))
#> # A tibble: 506 x 3
#>    crim    `zn|d`  medv
#>    <fct>   <fct>  <dbl>
#>  1 a@_$ä   a@_$?   24.0
#>  2 0.02731 0       21.6
#>  3 0.02729 0       34.7
#>  4 0.03237 0       33.4
#>  5 0.06905 0       36.2
#>  6 0.02985 0       28.7
#>  7 0.08829 12.5    22.9
#>  8 0.14455 12.5    27.1
#>  9 0.21124 12.5    16.5
#> 10 0.17004 12.5    18.9
#> # ... with 496 more rows
# Error
mod_housing_umlaut <- 
  cubist(x = boston_housing_umlaut[, -c(3)], y = boston_housing_umlaut$medv,
    committees = 10)
#> cubist code called exit with value 1
#> Error in strsplit(tmp, "\"")[[1]]: subscript out of bounds
# Test (3) comma "," -----------------------------------------------------------
# Recode factor levels
(boston_housing_comma <- boston_housing %>%
  mutate(
    # Recode respective level of first value
    crim = recode(crim, `0.00632` = "a@_$,", .default = levels(crim)),
    zn = recode(zn, `18` = "a@_$?", .default = levels(zn))
  ) %>%
  rename(`zn|d` = zn))
#> # A tibble: 506 x 3
#>    crim    `zn|d`  medv
#>    <fct>   <fct>  <dbl>
#>  1 a@_$,   a@_$?   24.0
#>  2 0.02731 0       21.6
#>  3 0.02729 0       34.7
#>  4 0.03237 0       33.4
#>  5 0.06905 0       36.2
#>  6 0.02985 0       28.7
#>  7 0.08829 12.5    22.9
#>  8 0.14455 12.5    27.1
#>  9 0.21124 12.5    16.5
#> 10 0.17004 12.5    18.9
#> # ... with 496 more rows
# Error
(mod_housing_comma <- 
  cubist(x = boston_housing_comma[, -c(3)], y = boston_housing_comma$medv,
    committees = 10))
#> cubist code called exit with value 1
#> Error in strsplit(tmp, "\"")[[1]]: subscript out of bounds
# Test (4) period "." ----------------------------------------------------------
# Recode factor levels
(boston_housing_period <- boston_housing %>%
  mutate(
    # Recode respective level of first value
    crim = recode(crim, `0.00632` = "a@_$.",  .default = levels(crim)),
    zn = recode(zn, `18` = "a@_$?", .default = levels(zn))
  ) %>%
  rename(`zn|d` = zn))
#> # A tibble: 506 x 3
#>    crim    `zn|d`  medv
#>    <fct>   <fct>  <dbl>
#>  1 a@_$.   a@_$?   24.0
#>  2 0.02731 0       21.6
#>  3 0.02729 0       34.7
#>  4 0.03237 0       33.4
#>  5 0.06905 0       36.2
#>  6 0.02985 0       28.7
#>  7 0.08829 12.5    22.9
#>  8 0.14455 12.5    27.1
#>  9 0.21124 12.5    16.5
#> 10 0.17004 12.5    18.9
#> # ... with 496 more rows
# Works
(mod_housing_period <- 
  cubist(x = boston_housing_period[, -c(3)], y = boston_housing_period$medv,
    committees = 10))
#> 
#> Call:
#> cubist.default(x = boston_housing_period[, -c(3)], y
#>  = boston_housing_period$medv, committees = 10)
#> 
#> Number of samples: 506 
#> Number of predictors: 2 
#> 
#> Number of committees: 10 
#> Number of rules per committee: 31, 28, 28, 26, 27, 25, 27, 24, 30, 24
# Test (5) colon ":" -----------------------------------------------------------
# Recode factor levels
(boston_housing_colon <- boston_housing %>%
  mutate(
    # Recode respective level of first value
    crim = recode(crim, `0.00632` = "a@_$:",  .default = levels(crim)),
    zn = recode(zn, `18` = "a@_$?", .default = levels(zn))
  ) %>%
  rename(`zn|d` = zn))
#> # A tibble: 506 x 3
#>    crim    `zn|d`  medv
#>    <fct>   <fct>  <dbl>
#>  1 a@_$:   a@_$?   24.0
#>  2 0.02731 0       21.6
#>  3 0.02729 0       34.7
#>  4 0.03237 0       33.4
#>  5 0.06905 0       36.2
#>  6 0.02985 0       28.7
#>  7 0.08829 12.5    22.9
#>  8 0.14455 12.5    27.1
#>  9 0.21124 12.5    16.5
#> 10 0.17004 12.5    18.9
#> # ... with 496 more rows
# Error
(mod_housing_colon <- 
  cubist(x = boston_housing_colon[, -c(3)], y = boston_housing_colon$medv,
    committees = 10))
#> cubist code called exit with value 1
#> Error in strsplit(tmp, "\"")[[1]]: subscript out of bounds
# Test (6) vertical bar "|" ----------------------------------------------------
# Recode factor levels
(boston_housing_bar <- boston_housing %>%
  mutate(
    # Recode respective level of first value
    crim = recode(crim, `0.00632` = "a@_$|",  .default = levels(crim)),
    zn = recode(zn, `18` = "a@_$?", .default = levels(zn))
  ) %>%
  rename(`zn|d` = zn))
#> # A tibble: 506 x 3
#>    crim    `zn|d`  medv
#>    <fct>   <fct>  <dbl>
#>  1 a@_$|   a@_$?   24.0
#>  2 0.02731 0       21.6
#>  3 0.02729 0       34.7
#>  4 0.03237 0       33.4
#>  5 0.06905 0       36.2
#>  6 0.02985 0       28.7
#>  7 0.08829 12.5    22.9
#>  8 0.14455 12.5    27.1
#>  9 0.21124 12.5    16.5
#> 10 0.17004 12.5    18.9
#> # ... with 496 more rows
# Error
(mod_housing_bar <- 
  cubist(x = boston_housing_bar[, -c(3)], y = boston_housing_bar$medv,
    committees = 10))
#> cubist code called exit with value 1
#> Error in strsplit(tmp, "\"")[[1]]: subscript out of bounds
# Test (7) semicolon ";" -------------------------------------------------------
# Recode factor levels
(boston_housing_semicol <- boston_housing %>%
  mutate(
    # Recode respective level of first value
    crim = recode(crim, `0.00632` = "a@_$;",  .default = levels(crim)),
    zn = recode(zn, `18` = "a@_$?", .default = levels(zn))
  ) %>%
  rename(`zn|d` = zn))
#> # A tibble: 506 x 3
#>    crim    `zn|d`  medv
#>    <fct>   <fct>  <dbl>
#>  1 a@_$;   a@_$?   24.0
#>  2 0.02731 0       21.6
#>  3 0.02729 0       34.7
#>  4 0.03237 0       33.4
#>  5 0.06905 0       36.2
#>  6 0.02985 0       28.7
#>  7 0.08829 12.5    22.9
#>  8 0.14455 12.5    27.1
#>  9 0.21124 12.5    16.5
#> 10 0.17004 12.5    18.9
#> # ... with 496 more rows
# Error
(mod_housing_semicol <- 
  cubist(x = boston_housing_semicol[, -c(3)], y = boston_housing_semicol$medv,
    committees = 10))
#> cubist code called exit with value 1
#> Error in strsplit(tmp, "\"")[[1]]: subscript out of bounds

Based on the errors above, escaping of the following characters does not work: ",", ":", ";", "|", , "ä". However, "." works fine. I was quite suprised, because according to this info page of Rulequest, escaping should work for comma, colon, period, and vertical bar. Here is is the output from the current escaping helper:

My guess is that "." is not a problem because C Cubist parses the values in the data file correctly due to separation by comma, and escaping has no effect.

Cubist:::escapes(",:.|;ä")
#> [1] "\\,\\\\\\:\\.\\\\\\|\\\\\\;\xc3\\\xa4"

Here is the session info output:

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.2 (2017-09-28)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       Europe/Zurich               
#>  date     2019-01-20
#> Packages -----------------------------------------------------------------
#>  package    * version date       source         
#>  assertthat   0.2.0   2017-04-11 CRAN (R 3.4.2) 
#>  backports    1.1.2   2017-12-13 cran (@1.1.2)  
#>  base       * 3.4.2   2017-10-04 local          
#>  bindr        0.1.1   2018-03-13 CRAN (R 3.4.4) 
#>  bindrcpp     0.2.2   2018-03-29 CRAN (R 3.4.4) 
#>  broom        0.4.5   2018-07-03 cran (@0.4.5)  
#>  cellranger   1.1.0   2016-07-27 CRAN (R 3.4.2) 
#>  cli          1.0.0   2017-11-05 CRAN (R 3.4.2) 
#>  colorspace   1.3-2   2016-12-14 CRAN (R 3.4.2) 
#>  compiler     3.4.2   2017-10-04 local          
#>  crayon       1.3.4   2017-09-16 CRAN (R 3.4.2) 
#>  Cubist     * 0.2.2   2018-05-21 CRAN (R 3.4.4) 
#>  datasets   * 3.4.2   2017-10-04 local          
#>  devtools     1.13.4  2017-11-09 CRAN (R 3.4.2) 
#>  digest       0.6.18  2018-10-10 cran (@0.6.18) 
#>  dplyr      * 0.7.8   2018-11-10 cran (@0.7.8)  
#>  evaluate     0.10.1  2017-06-24 CRAN (R 3.4.2) 
#>  forcats    * 0.2.0   2017-01-23 CRAN (R 3.4.2) 
#>  foreign      0.8-69  2017-06-21 CRAN (R 3.4.2) 
#>  ggplot2    * 3.1.0   2018-10-25 cran (@3.1.0)  
#>  glue         1.3.0   2018-07-17 cran (@1.3.0)  
#>  graphics   * 3.4.2   2017-10-04 local          
#>  grDevices  * 3.4.2   2017-10-04 local          
#>  grid         3.4.2   2017-10-04 local          
#>  gtable       0.2.0   2016-02-26 CRAN (R 3.4.2) 
#>  haven        1.1.0   2017-07-09 CRAN (R 3.4.2) 
#>  hms          0.4.0   2017-11-23 CRAN (R 3.4.3) 
#>  htmltools    0.3.6   2017-04-28 CRAN (R 3.4.2) 
#>  httr         1.3.1   2017-08-20 CRAN (R 3.4.1) 
#>  jsonlite     1.5     2017-06-01 CRAN (R 3.4.2) 
#>  knitr        1.20    2018-02-20 CRAN (R 3.4.3) 
#>  lattice    * 0.20-35 2017-03-25 CRAN (R 3.4.2) 
#>  lazyeval     0.2.1   2017-10-29 CRAN (R 3.4.2) 
#>  lubridate    1.7.4   2018-04-11 cran (@1.7.4)  
#>  magrittr     1.5     2014-11-22 CRAN (R 3.4.2) 
#>  memoise      1.1.0   2017-04-21 CRAN (R 3.4.2) 
#>  methods    * 3.4.2   2017-10-04 local          
#>  mlbench    * 2.1-1   2012-07-10 CRAN (R 3.4.0) 
#>  mnormt       1.5-5   2016-10-15 CRAN (R 3.4.0) 
#>  modelr       0.1.2   2018-05-11 cran (@0.1.2)  
#>  munsell      0.5.0   2018-06-12 cran (@0.5.0)  
#>  nlme         3.1-131 2017-02-06 CRAN (R 3.4.2) 
#>  parallel     3.4.2   2017-10-04 local          
#>  pillar       1.1.0   2018-01-14 cran (@1.1.0)  
#>  pkgconfig    2.0.2   2018-08-16 cran (@2.0.2)  
#>  plyr         1.8.4   2016-06-08 CRAN (R 3.4.2) 
#>  psych        1.8.4   2018-05-06 cran (@1.8.4)  
#>  purrr      * 0.2.5   2018-05-29 cran (@0.2.5)  
#>  R6           2.3.0   2018-10-04 cran (@2.3.0)  
#>  Rcpp         1.0.0   2018-11-07 cran (@1.0.0)  
#>  readr      * 1.1.1   2017-05-16 CRAN (R 3.4.2) 
#>  readxl       1.0.0   2017-04-18 CRAN (R 3.4.2) 
#>  reshape2     1.4.3   2017-12-11 cran (@1.4.3)  
#>  rlang        0.3.0.1 2018-10-25 cran (@0.3.0.1)
#>  rmarkdown    1.10    2018-06-11 CRAN (R 3.4.4) 
#>  rprojroot    1.2     2017-01-16 CRAN (R 3.4.2) 
#>  rstudioapi   0.8     2018-10-02 cran (@0.8)    
#>  rvest        0.3.2   2016-06-17 CRAN (R 3.4.2) 
#>  scales       1.0.0   2018-08-09 cran (@1.0.0)  
#>  stats      * 3.4.2   2017-10-04 local          
#>  stringi      1.2.4   2018-07-20 cran (@1.2.4)  
#>  stringr    * 1.3.1   2018-05-10 cran (@1.3.1)  
#>  tibble     * 1.4.2   2018-01-22 cran (@1.4.2)  
#>  tidyr      * 0.8.2   2018-10-28 cran (@0.8.2)  
#>  tidyselect   0.2.5   2018-10-11 cran (@0.2.5)  
#>  tidyverse  * 1.2.1   2017-11-14 CRAN (R 3.4.2) 
#>  tools        3.4.2   2017-10-04 local          
#>  utils      * 3.4.2   2017-10-04 local          
#>  withr        2.1.2   2018-03-15 cran (@2.1.2)  
#>  xml2         1.1.1   2017-01-24 CRAN (R 3.4.2) 
#>  yaml         2.1.15  2017-12-01 CRAN (R 3.4.3)

I made a commit in the forked repo here to fix a part of the issues here

# Current escapes function
# https://github.com/topepo/Cubist/blob/master/R/makeNamesFile.R
escapes <- function(x, chars = c(":", ";", "|")) {
  for (i in chars)
    x <- gsub(i, paste("\\", i, sep = ""), x, fixed = TRUE)
  gsub("([^[:alnum:]^[:space:]])", "\\\\\\1", x, useBytes = TRUE)  
}
# Modified escapes function
escapes <- function(x, pattern = "([,:|])") {
  gsub(pattern,  "\\\\\\1", x, useBytes = TRUE)
}
escapes(",:.|;ä")
#> [1] "\\,\\:.\\|;ä"

The new escapes() helper only escapes ",", ":", and "|". This resolves issues with umlaut parsing (no fixed = TRUE in gsub()), and Cubist now works when factorial variables contain these. This change lets Cubist::cubist() compute successfully for semicolon ";" character in factors, but unfortunately not for the remaining special characters. However, I was not able to figure out how to get escaping of ",", ":", and "|" working.

I have no experience in C (yet). Do you have any ideas why escaping fails for these reserved Cubist characters? Or is it an issue in the original Rulequest source code and escaping for values is not supported, despite being mentioned in the Rulequest overview webpage? Maybe this is also locale specific and depends on the encoding conversion between C Cubist files and R objects.

Would be great to fully support escaping, because these special characters are quite common. If there is no easy solution, I think it would be helpful to include checks in Cubist::cubist() and let it error with an informative message when these characters are in factors or character columns of the predictor data frame.

Thanks for your help, looking forward to your insight into this issue.

Cheers, Philipp

topepo commented 3 years ago

Or is it an issue in the original Rulequest source code and escaping for values is not supported, despite being mentioned in the Rulequest overview webpage?

Yeah, it's their original limitation.

Do you want to PR or should I just make the change?

philipp-baumann commented 1 year ago

Hi @topepo sorry just saw now, it's been long ago. If you have time just making the change, would be great! otherwise, I guess if this is an original limitation, a note would also suffice in the README. Thanks for making this pkg. Cheers