Closed readyready15728 closed 3 years ago
Hello @readyready15728,
I'm gonna assume that your error comes from tune_grid()
here https://github.com/readyready15728/sms-spam/blob/1781e89f4ed5051247ab85a0b05af78b7d892626/learn.R#L83-L89
You are getting a warning because you intended to tune penalty()
and max_tokens()
as noted in your final_grid
. But you didn't specify that in your parsnip/recipe object. You need to use tune()
as a placeholder for the values you are trying to tune.
The following code will specify max_tokens
to be tuned. The svm_rbf()
doesn't have a penalty
argument. So you can drop that from your final_grid
.
sms_recipe <- recipe(class ~ text, data=sms_training) %>%
step_tokenize(text) %>%
step_tokenfilter(text, max_tokens=tune()) %>%
step_tfidf(text)
# Create SVM specification
svm_specification <- svm_rbf() %>%
set_mode('classification') %>%
set_engine('kernlab')
# Create new workflow for CV
svm_workflow <- workflow() %>%
add_recipe(sms_recipe) %>%
add_model(svm_specification)
I want to be clear about where to insert the suggested code. I get throwing out penalty
but attempting to use tune()
with step_tokenfilter
before training set evaluation results in the following error:
[1] "Evaluating performance on training set:"
x Fold01: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold02: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold03: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold04: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold05: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold06: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold07: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold08: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold09: internal: Error: Can't subset columns that don't exist.
✖ Colu...
x Fold10: internal: Error: Can't subset columns that don't exist.
✖ Colu...
Sorry about the confusion. Ideally you want to have it after you do fit_resamples()
and before tune_grid()
. Somewhere around line 82 looks good.
Once you get a hang of this. I would recommend you take a look at the themis package: https://themis.tidymodels.org/. This package contains recipe steps that help you deal with imbalanced data. This way you can do the adjustment inside the resampling instead of outside. There is a step_rose()
as well.
Well, I tried implementing that strategy but ran into another cryptic error. I'm not sure if it was because I put the stuff into a function to adhere to the DRY principle, which would be very weird, but I've realized I don't really need tuning anyway so I'm going to close the issue and perhaps revisit it another day. Sorry, just got worn out.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
The problem
I'm using
tidymodels
and associated R packages on the SMS Spam Collection dataset from Kaggle. Specifically, I am using these packages to distinguish "ham" (legit, regular) SMS messages from their spam counterparts. The problem is encountering a non-fatal but highly undesirable error when evaluating performance on the test set. The error message is:The error has persisted despite making sure to adhere to tutorial material read in Supervised Machine Learning for Text Analysis in R (chapter 7 specifically) as well as in an article written by one Rebecca Barter. Both sources appear to be in accord. I have also followed the advice in an answer to the same problem on Stack Overflow, namely, running
devtools::install_github('tidymodels/tune')
and trying again. The error still persisted.Having said all of that, I was at least able to get figures for accuracy and AUC-ROC for the test set. Both are through the roof, over 0.99 each. I am blown away by what the R community has created.
Reproducible example
Hopefully the way I'm presenting it is viable. What I've done is create a branch of my repository,
tidymodels-bug
, that is "frozen" as it is right now so I can make further changes onmaster
if I desire without affecting your response to how things are exactly now. You will find it here:https://github.com/readyready15728/sms-spam/tree/tidymodels-bug
I believe I've put
set.seed(42)
wherever necessary, although I doubt the RNG is to blame. In lieu of aREADME.md
, which I generally like to add when I believe a project is substantially complete, I will give you a brief description here. The original dataset,sms.csv
, has only ~13% spam, so I wrote a script calledbalance.R
to make them about 50/50 in a new file calledsms-balanced.csv
.learn.R
then takes over, first performing cross validation with the training set, saving the fit to speed things up later (or alternatively loading an existing fit), then evaluating the performance on the test set in a similar fashion, also using the same saving mechanism.The training set evaluation works fine. It's testing evaluation where the error occurs. I made sure to check both of my sources thoroughly to make sure I was doing things right and I don't think I made an error. I can at least see accuracy and AUC-ROC and they are highly satisfactory, almost perfect even, but I want the full set of metrics specified towards the very beginning without the error.
To assist further, I have the version of every library installed here as a CSV:
The code used to create the above printout may be useful in the future:
https://github.com/readyready15728/get-all-r-packages-and-versions