topepo / C5.0

An R package for fitting Quinlan's C5.0 classification model
https://topepo.github.io/C5.0/
50 stars 20 forks source link

Missing character predictors cause model build to fail #39

Open jlries61 opened 3 years ago

jlries61 commented 3 years ago

Consider the following command sequence:

require("C50")
require("dplyr")

indataName <- "bankmkt_part1m10.csv"
target <- "y"
NPART <- 20
keep <- c()
exclude <-c()

#The keeplist is automatically generated
mkkeep <- function(dataset, keep, exclude) {
  varnames <- names(dataset)
  orgnames <- list()
  for (varname in varnames) orgnames[toupper(varname)] <- varname
  if (is.null(keep)) keep <- orgnames
  KEEP <- toupper(keep)
  EXCLUDE <- toupper(exclude)
  KEEP <- setdiff(KEEP, EXCLUDE)
  KEEP <- intersect(KEEP, names(orgnames))
  return(unlist(orgnames[KEEP]))
}

indata <- read.csv(indataName)
indata[,target] = factor(indata[,target])

for (part in 1:NPART) exclude <- c(exclude, paste0("SAMPLE", part))
exclude <- c(exclude, target)
keep <- mkkeep(indata, keep, exclude)
form <- as.formula(paste(target, "~", paste(keep, collapse="+")))
model <-C5.0(formula=form, data=indata, trials=1)
summary(model)

The model build fails with the following message: c50 code called exit with value 1

summary(model) produces the following:


Call:
C5.0.formula(formula = form, data = indata, trials = 1, control
 = C5.0Control(subset = FALSE, winnow = TRUE, noGlobalPruning = FALSE))

C5.0 [Release 2.07 GPL Edition]     Sat Jul 31 11:52:01 2021
-------------------------------

*** line 7 of `undefined.names': missing name or value before `,'

Error limit exceeded

The value of model$names is:

[1] "| Generated using R version 4.0.5 (2021-03-31)\n| on Sat Jul 31 11:59:19 2021\noutcome.\n\noutcome: 0,1.\nage: continuous.\njob: management,technician,entrepreneur,blue-collar,unknown,retired,admin.,services,,self-employed,unemployed,housemaid,student.\nmarital: ,single,married,divorced.\neducation: tertiary,secondary,unknown,primary,.\ndefault: continuous.\nbalance: continuous.\nhousing: yes,,no.\nloan: continuous.\ncontact: unknown,cellular,telephone.\nday: continuous.\nmonth: may,,jun,jul,aug,oct,nov,dec,jan,feb,mar,apr,sep.\nduration: continuous.\ncampaign: continuous.\npdays: continuous.\nprevious: continuous.\npoutcome: unknown,failure,other,success.\n"

Change the value of indataName to "bankmkt_part1.csv" (which has no missings) and the model is built normally. A zipfile containing the R script and the two datasets is attached here. c50bug.zip