topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

Getting NAs error in train() with text data #1285

Open Nehagupta90 opened 2 years ago

Nehagupta90 commented 2 years ago

Hello everyone, I have data with text feature (DESCRIPTION) and output variable (TYPE which is a factor having three values). I perform the preprocessing on text feature and then run train(), but it gives me error

Something is wrong; all the Accuracy metric values are missing: logLoss AUC prAUC Accuracy Kappa
Min. : NA Min. : NA Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA Max. : NA Max. : NA

library(dplyr) library(caret) library(tm) library(text2vec)

d=read.csv("SONAR_RULES.csv", stringsAsFactors = TRUE) d$DESCRIPTION= as.character(d$DESCRIPTION)

d= d[, !names(d) %in% c("REMEDIATION_GAP_MULT", "REMEDIATION_FUNCTION", "REMEDIATION_BASE_EFFORT")

########### preprocessing of train data start here

review_corpus <- VCorpus( VectorSource(train$DESCRIPTION))

review_corpus <- review_corpus %>% tm_map(content_transformer(tolower)) %>% # lowercase tm_map(removeNumbers) %>% # remove numerical character tm_map(removeWords, stopwords("english")) %>% # remove stopwords (and, the, am) tm_map(removePunctuation) %>% # remove punctuation mark tm_map(stemDocument) %>% # stem word (e.g. from walking to walk) tm_map(stripWhitespace) # strip double white space

train_dtm <- DocumentTermMatrix(review_corpus)

freq <- findFreqTerms(train_dtm, 30) length(freq)

train_dtm <- train_dtm[ , freq] train_dtm$ncol

bernoulli_conv <- function(x){ x <- factor( ifelse(x > 0, 1, 0), levels = c(0,1), labels = c("Absent", "Present") ) return(x)}

convert the document-term matrix

train_x<- apply(train_dtm, 2, bernoulli_conv)

create the target variable

train_label <- train$TYPE

cv.folds <- createMultiFolds(train$TYPE, k = 10, times = 3)

ctrl <- trainControl(method = "cv",number=3, index = cv.folds, classProbs = TRUE, summaryFunction = multiClassSummary

set.seed(30218)

m= train(y = train_label, x = train_x, method = "rf" , metric = "Accuracy",

preProc = c("center", "scale", "nzv"),

  trControl = ctrl)

######### When I run train, it gives me error ########

sessionInfo() R version 4.1.3 (2022-03-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

Random number generation: RNG: L'Ecuyer-CMRG Normal: Inversion Sample: Rejection

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base