topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

Compatibility of write_csv/read_csv vs write.csv/read.csv and perhaps broader tidyverse compatibility with caret learning_curve_dat #1333

Open BenJCQuah opened 1 year ago

BenJCQuah commented 1 year ago

Hi, Thanks for your great caret package.

I am new to ML code and R in general. I would like to use a learning curve in some of my pipeline using learning_curve_dat().

I am finding if I use tidyverse code to pipe or manipulate data that learning_curve_dat() seems to fail.

To give an example please see the options of write_csv/read_csv vs write.csv/read.csv below

the first code section doesn't work (at least for me) using write_csv/read_csv

but the second identical code (apart from the read and write) does when I use write.csv/read.csv

I am also finding if I use other tidyverse code it also seems to fail (please let me know if additional examples are required)

Is this a known issue? Or am I doing something else wrong.

Thanks!

Ben

library(caret) library(pander) library(pastecs) library(catboost) library(randomForest) library(dplyr) library(tidyverse)

USING write_csv/read_csv

set.seed(1412) class_dat <- twoClassSim(1000)

write_csv(class_dat, "class_dat.csv") class_data <- read_csv("class_dat.csv") class_data$Class <- factor(as.character(class_data$Class)) levels(class_data$Class)

sapply(class_data, class)

set.seed(29510) rf_data <- learning_curve_dat(dat = class_data, outcome = "Class", test_prop = 1/4,

train arguments

                          method = "rf", 
                          metric = "Kappa",
                          trControl = trainControl(## 10-fold CV
                            method = "repeatedcv",
                            number = 10,
                            ## repeated ten times
                            repeats = 1)

)

This is the error I get

Error in createDataPartition(dat[, outcome], p = 1 - test_prop, list = FALSE) : y must have at least 2 data points

USING write.csv/read.csv

set.seed(1412) class_dat <- twoClassSim(1000)

write.csv(class_dat, "class_dat.csv") class_data <- read.csv("class_dat.csv")

class_data$Class <- factor(as.character(class_data$Class)) levels(class_data$Class)

sapply(class_data, class)

set.seed(510) rf_data <- learning_curve_dat(dat = class_data, outcome = "Class", test_prop = 1/4,

train arguments

                          method = "rf", 
                          metric = "Kappa",
                          trControl = trainControl(## 10-fold CV
                            method = "repeatedcv",
                            number = 10,
                            ## repeated ten times
                            repeats = 1)

)

RUNS FINE