tidymodels / tune

Tools for tidy parameter tuning
https://tune.tidymodels.org
Other
280 stars 42 forks source link

Nearly flawless code marred by failure to recognize tune_grid #429

Closed readyready15728 closed 3 years ago

readyready15728 commented 3 years ago

The problem

I'm using tidymodels and associated R packages on the SMS Spam Collection dataset from Kaggle. Specifically, I am using these packages to distinguish "ham" (legit, regular) SMS messages from their spam counterparts. The problem is encountering a non-fatal but highly undesirable error when evaluating performance on the test set. The error message is:

Warning message:
No tuning parameters have been detected, performance will be evaluated using the resamples with no tuning. Did you want to [tune()] parameters?

The error has persisted despite making sure to adhere to tutorial material read in Supervised Machine Learning for Text Analysis in R (chapter 7 specifically) as well as in an article written by one Rebecca Barter. Both sources appear to be in accord. I have also followed the advice in an answer to the same problem on Stack Overflow, namely, running devtools::install_github('tidymodels/tune') and trying again. The error still persisted.

Having said all of that, I was at least able to get figures for accuracy and AUC-ROC for the test set. Both are through the roof, over 0.99 each. I am blown away by what the R community has created.

Reproducible example

Hopefully the way I'm presenting it is viable. What I've done is create a branch of my repository, tidymodels-bug, that is "frozen" as it is right now so I can make further changes on master if I desire without affecting your response to how things are exactly now. You will find it here:

https://github.com/readyready15728/sms-spam/tree/tidymodels-bug

I believe I've put set.seed(42) wherever necessary, although I doubt the RNG is to blame. In lieu of a README.md, which I generally like to add when I believe a project is substantially complete, I will give you a brief description here. The original dataset, sms.csv, has only ~13% spam, so I wrote a script called balance.R to make them about 50/50 in a new file called sms-balanced.csv. learn.R then takes over, first performing cross validation with the training set, saving the fit to speed things up later (or alternatively loading an existing fit), then evaluating the performance on the test set in a similar fashion, also using the same saving mechanism.

The training set evaluation works fine. It's testing evaluation where the error occurs. I made sure to check both of my sources thoroughly to make sure I was doing things right and I don't think I made an error. I can at least see accuracy and AUC-ROC and they are highly satisfactory, almost perfect even, but I want the full set of metrics specified towards the very beginning without the error.

To assist further, I have the version of every library installed here as a CSV:

package,version
abind,1.4-5
AmesHousing,0.0.4
anytime,0.3.9
askpass,1.1
assertthat,0.2.1
backports,1.2.1
base64enc,0.1-3
BH,1.75.0-0
BiocManager,1.30.16
BiocVersion,3.13.1
bit,4.0.4
bit64,4.0.5
bitops,1.0-7
blob,1.2.2
brew,1.0-6
brio,1.1.2
broom,0.7.9
cachem,1.0.6
callr,3.7.0
car,3.0-11
carData,3.0-4
caret,6.0-89
caTools,1.18.2
cellranger,1.1.0
checkmate,2.0.0
chron,2.3-56
class,7.3-19
classInt,0.4-3
cli,3.1.0
clipr,0.7.1
coda,0.19-4
coefplot,1.2.7
colorspace,2.0-2
commonmark,1.7
compute.es,0.2-5
conflicted,1.0.4
conquer,1.0.2
corrplot,0.90
cpp11,0.4.1
crayon,1.4.2
credentials,1.3.1
crosstalk,1.1.1
curl,4.3.2
data.table,1.14.0
DBI,1.1.1
dbplyr,2.1.1
desc,1.3.0
devtools,2.4.2
dials,0.0.10
DiceDesign,1.9
diffobj,0.3.4
digest,0.6.28
discrim,0.1.3
distributional,0.2.2
DMwR,0.4.1
dplyr,1.0.7
drat,0.2.1
dslabs,0.7.4
dtplyr,1.1.0
dummies,1.5.6
dygraphs,1.1.1.6
e1071,1.7-9
ellipsis,0.3.2
evaluate,0.14
fable,0.3.1
fabletools,0.3.1
fansi,0.5.0
farver,2.1.0
fastmap,1.1.0
feasts,0.2.2
forcats,0.5.1
foreach,1.5.1
forecast,8.15
fpp3,0.4.0
fracdiff,1.5-1
fs,1.5.0
furrr,0.2.3
future,1.23.0
future.apply,1.8.1
gapminder,0.3.0
gargle,1.2.0
generics,0.1.1
gert,1.4.1
GGally,2.1.2
ggfortify,0.4.12
ggplot2,3.3.5
ggtext,0.1.1
ggtheme,0.1.0
ggthemes,4.2.4
gh,1.3.0
gitcreds,0.1.1
glmnet,4.1-2
globals,0.14.0
glue,1.5.0
gmp,0.6-2
goftest,1.2-2
googledrive,2.0.0
googlesheets4,1.0.0
gower,0.2.2
GPfit,1.0-8
gplots,3.1.1
gridExtra,2.3
gridtext,0.1.4
gtable,0.3.0
gtools,3.9.2
h2o,3.32.1.3
hardhat,0.1.6
haven,2.4.3
here,1.0.1
highr,0.9
hms,1.1.0
htmltools,0.5.2
htmlwidgets,1.5.4
httr,1.4.2
ids,1.0.1
imputeTS,3.2
infer,1.0.0
InformationValue,1.2.3
ini,0.3.1
inline,0.3.19
ipred,0.9-12
isoband,0.2.5
iterators,1.0.13
jpeg,0.1-9
jquerylib,0.1.4
jsonlite,1.7.2
kernlab,0.9-29
knitr,1.34
labeling,0.4.2
Lahman,9.0-0
later,1.3.0
lava,1.6.10
lazyeval,0.2.2
leaflet,2.0.4.1
leaflet.providers,1.9.0
lhs,1.1.3
lifecycle,1.0.1
listenv,0.8.0
lme4,1.1-27.1
lmtest,0.9-38
loo,2.4.1
lubridate,1.8.0
magrittr,2.0.1
mapproj,1.2.7
maps,3.3.0
maptools,1.1-2
markdown,1.1
MatrixModels,0.5-0
matrixStats,0.61.0
memoise,2.0.0
Metrics,0.1.4
mime,0.11
minqa,1.2.4
modeldata,0.1.1
ModelMetrics,1.2.2.2
modelr,0.1.8
munsell,0.5.0
naivebayes,0.9.7
nloptr,1.2.2.2
nortest,1.0-4
numbers,0.8-2
numDeriv,2016.8-1.1
nycflights13,1.0.2
olsrr,0.5.3
openssl,1.4.5
openxlsx,4.2.4
parallelly,1.28.1
parsedate,1.2.1
parsnip,0.1.7.9001
patchwork,1.1.1
pbkrtest,0.5.1
pillar,1.6.4
pkgbuild,1.2.0
pkgconfig,2.0.3
pkgload,1.2.2
plotly,4.9.4.1
plyr,1.8.6
png,0.1-7
praise,1.0.0
prettyunits,1.1.1
pROC,1.18.0
processx,3.5.2
prodlim,2019.11.13
progress,1.2.2
progressr,0.9.0
promises,1.2.0.1
proxy,0.4-26
ps,1.6.0
purrr,0.3.4
quadprog,1.5-8
quantmod,0.4.18
quantreg,5.86
R6,2.5.1
randomForest,4.6-14
ranger,0.13.1
rappdirs,0.3.3
raster,3.4-13
rattle,5.4.0
rcmdcheck,1.3.3
RColorBrewer,1.1-2
Rcpp,1.0.7
RcppArmadillo,0.10.6.0.0
RcppEigen,0.3.3.9.1
RcppParallel,5.1.4
RCurl,1.98-1.5
readr,2.0.1
readxl,1.3.1
recipes,0.1.17.9000
rematch,1.0.1
rematch2,2.1.2
remotes,2.4.0
reprex,2.0.1
reshape,0.8.8
reshape2,1.4.4
reticulate,1.22
rio,0.5.27
rjags,4-12
rlang,0.4.12
rmarkdown,2.11
Rmpfr,0.8-4
ROCR,1.0-11
ROSE,0.0-4
roxygen2,7.1.2
rpart.plot,3.1.0
rprojroot,2.0.2
rsample,0.1.1
rstudioapi,0.13
runjags,2.2.0-2
rversions,2.1.1
rvest,1.0.1
scales,1.1.1
selectr,0.4-2
sessioninfo,1.1.1
shape,1.4.6
slider,0.2.2
SnowballC,0.7.0
sp,1.4-5
SparseM,1.81
SQUAREM,2021.1
StanHeaders,2.21.0-7
stinepack,1.4
stringi,1.7.4
stringr,1.4.0
survival,3.2-13
sys,3.4
testthat,3.0.4
textrecipes,0.4.1
tibble,3.1.6
tidymodels,0.1.4
tidyr,1.1.4
tidyselect,1.1.1
tidyverse,1.3.1
timeDate,3043.102
tinytex,0.33
tokenizers,0.2.1
tseries,0.10-48
tsibble,1.0.1
tsibbledata,0.3.0
TTR,0.24.2
tune,0.1.6.9001
tzdb,0.1.2
units,0.7-2
urca,1.3-0
useful,1.2.6
usethis,2.0.1
utf8,1.2.2
uuid,0.1-4
V8,3.4.2
vctrs,0.3.8
vip,0.3.2
viridis,0.6.1
viridisLite,0.4.0
visdat,0.5.3
vroom,1.5.5
waldo,0.3.1
warp,0.2.0
whisker,0.4
withr,2.4.2
wk,0.5.0
workflows,0.2.4.9000
workflowsets,0.1.0
xfun,0.26
XML,3.99-0.8
xml2,1.3.2
xopen,1.0.0
xts,0.12.1
yaml,2.2.1
yardstick,0.0.8
zip,2.2.0
zoo,1.8-9
data.table,1.14.0
jsonlite,1.7.2
xgboost,1.5.0.1
base,4.1.0
boot,1.3-28
class,7.3-19
cluster,2.1.2
codetools,0.2-18
compiler,4.1.0
datasets,4.1.0
foreign,0.8-81
graphics,4.1.0
grDevices,4.1.0
grid,4.1.0
KernSmooth,2.23-20
lattice,0.20-44
MASS,7.3-54
Matrix,1.3-3
methods,4.1.0
mgcv,1.8-35
nlme,3.1-152
nnet,7.3-16
parallel,4.1.0
rpart,4.1-15
spatial,7.3-14
splines,4.1.0
stats,4.1.0
stats4,4.1.0
survival,3.2-11
tcltk,4.1.0
tools,4.1.0
utils,4.1.0

The code used to create the above printout may be useful in the future:

https://github.com/readyready15728/get-all-r-packages-and-versions

EmilHvitfeldt commented 3 years ago

Hello @readyready15728,

I'm gonna assume that your error comes from tune_grid() here https://github.com/readyready15728/sms-spam/blob/1781e89f4ed5051247ab85a0b05af78b7d892626/learn.R#L83-L89

You are getting a warning because you intended to tune penalty() and max_tokens() as noted in your final_grid. But you didn't specify that in your parsnip/recipe object. You need to use tune() as a placeholder for the values you are trying to tune.

The following code will specify max_tokens to be tuned. The svm_rbf() doesn't have a penalty argument. So you can drop that from your final_grid.

sms_recipe <- recipe(class ~ text, data=sms_training) %>% 
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens=tune()) %>%
  step_tfidf(text)

# Create SVM specification 
svm_specification <- svm_rbf() %>%
  set_mode('classification') %>%
  set_engine('kernlab')

# Create new workflow for CV
svm_workflow <- workflow() %>%
  add_recipe(sms_recipe) %>%
  add_model(svm_specification)
readyready15728 commented 3 years ago

I want to be clear about where to insert the suggested code. I get throwing out penalty but attempting to use tune() with step_tokenfilter before training set evaluation results in the following error:

[1] "Evaluating performance on training set:"
x Fold01: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold02: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold03: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold04: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold05: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold06: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold07: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold08: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold09: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold10: internal: Error: Can't subset columns that don't exist.
✖ Colu...
EmilHvitfeldt commented 3 years ago

Sorry about the confusion. Ideally you want to have it after you do fit_resamples() and before tune_grid(). Somewhere around line 82 looks good.

Once you get a hang of this. I would recommend you take a look at the themis package: https://themis.tidymodels.org/. This package contains recipe steps that help you deal with imbalanced data. This way you can do the adjustment inside the resampling instead of outside. There is a step_rose() as well.

readyready15728 commented 3 years ago

Well, I tried implementing that strategy but ran into another cryptic error. I'm not sure if it was because I put the stuff into a function to adhere to the DRY principle, which would be very weird, but I've realized I don't really need tuning anyway so I'm going to close the issue and perhaps revisit it another day. Sorry, just got worn out.

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.