Open trevb11 opened 12 months ago
hi when you post issues please use triple backticks to format code properly, which makes it easier to read.
```r
> test.accuracies.list <- list()
> for (prediction in all.predictions.list){
+ count_ones <- sum(all.predictions.list[[prediction]] == 1)
+ predicted.rate <- count_ones / nrow(all.predictions.list[[prediction]])
+ test.accuracy <- predicted.rate / true.rate
+ test.accuracies.list[[prediction]] <- test.accuracy
+
+ }
Error in all.predictions.list[[prediction]] :
attempt to select less than one element in get1index <real>
becomes
```r
> test.accuracies.list <- list()
> for (prediction in all.predictions.list){
+ count_ones <- sum(all.predictions.list[[prediction]] == 1)
+ predicted.rate <- count_ones / nrow(all.predictions.list[[prediction]])
+ test.accuracy <- predicted.rate / true.rate
+ test.accuracies.list[[prediction]] <- test.accuracy
+
+ }
Error in all.predictions.list[[prediction]] :
attempt to select less than one element in get1index <real>
the error message says "attempt to select less than one element in get1index seq_along(all.predictions.list)
?
why are you doing sum(all.predictions.list[[prediction]] == 1)
?
accuracy rate should be computed by mean(predictions == labels)
Did you read my blog? https://tdhock.github.io/blog/2023/comparing-ml-frameworks/
I'm reading the blog now it's helpful. Making the necessary changes.
hey @tdhock can you please help?
please re-read https://stackoverflow.com/help/minimal-reproducible-example
@EngineerDanny could use some help with computing the test.accuracies I need if you have the time (3 values for cv.glmnet folds) (3 for featureless folds)
Toby said to use
test.accuracy = mean(predictions == labels)
but I'm not sure what this means.
predictions should be a vector of predicted classes (one for each data point/row in the test set), and labels should be a vector of label/true class values. Please read https://tdhock.github.io/blog/2023/comparing-ml-frameworks/ in particular the relevant part for computing accuracy is below,
pred.list <- list(
cv_glmnet=factor(predict(
glmnet.model, one.data.split$test$input.mat, type="class")),
featureless=rep(most.freq.label, one.data.split$test$set.obs),
"1nn"=class::knn(
one.data.split$train$input.mat,
one.data.split$test$input.mat,
one.data.split$train$output.vec))
for(algorithm in names(pred.list)){
pred.vec <- pred.list[[algorithm]]
is.correct <- pred.vec == one.data.split$test$output.vec
accuracy.percent <- 100*mean(is.correct)
accuracy.dt.list[[paste(
data.name, test.fold, algorithm
)]] <- data.table(
data.name, test.fold, algorithm, accuracy.percent)
}
@tdhock Taking another try at a minimal reproducible example. Everything is pretty much working, but I am struggling to generate the featureless vectors for each fold in 1:n.folds. I think you need the whole code to run it... The error I'm having is in the block
featureless.list <- list()
for(fold in 1:n.folds){
featureless <- rep(most.frequent.class, test.set$Fold == fold)
featureless.list[[fold]] <- featureless
}
I don't believe the test.set$Fold == fold is correct. I will come by office hours tomorrow morning as well. Whole script...
setwd("~/Users/jared/ASD_CSV_FILES")
library(MASS)
library(stats)
library(data.table)
library(readxl)
library(dplyr)
file.list <- list()
list_of_file_names <- c("2016_17_NSCH_Topical_Implicate_CAHMI_DRCv2.csv", "2018-2019_NSCH_Topical_CAHMI_DRCv2.csv")
for(name in list_of_file_names){
file_path <- file.path(getwd(), name)
file.list[[name]] <-data.table::fread(file_path)
}
#make a data frame of codes and their category
COMORBIDITY <- c("K2Q34A", "K2Q34B", "K2Q36A", "K2Q36B", "K2Q60A", "K2Q60B", "K2Q37A", "K2Q37B",
"K2Q30A", "K2Q30B", "K2Q31A", "K2Q31B")
RESIDENCE <- c("LIVEUSA_MO", "K11Q43R")
DEMOGRAPHIC <- c("SC_RACER", "SC_HISPANIC_R", "HOUSE_GEN", "A1_GRADE", "A2_GRADE", "HHLANGUAGE", "SC_AGE_YEARS", "SC_SEX")
outputs <- c("K2Q35A", "K2Q35B")
codes <- c(COMORBIDITY, RESIDENCE, DEMOGRAPHIC, outputs)
descriptions <- c("Behavior Problems", "Behavior Problems Currently", "Developmental Delay", "Developmental Delay Currently",
"Intellectual Disability", "Intellectual Disability Currently",
"Speech Disorder", "Speech Disorder Currently", "Learning Disability", "Learning Disability Currently",
"ADD/ADHD", "ADD/ADHD Currently",
"How Long Living in the United States - Years", "How Many Times Moved to New Address", "Race of Selected Child, Recode",
"Hispanic Origin of Selected Child, Recode", "Parental Nativity", "Adult 1 - Highest Completed Year of School", "Adult 2 - Highest Completed Year of School",
"Primary HH Language", "Age of Seleced Child - In Years", "Sex of Selected Child", "Autism ASD", "Autism ASD Currently")
code_dictionary <- setNames(descriptions, codes)
categories <- rep(c("COMORBIDITY", "RESIDENCE", "DEMOGRAPHIC"), c(length(COMORBIDITY), length(RESIDENCE), length(DEMOGRAPHIC)))
#creating learning.codes table
learning.codes <- data.table()
for(file_name in names(file.list)){
subset_data <- file.list[[file_name]][, ..codes]
new_data <- data.table(matrix(NA, nrow = nrow(file.list[[file_name]]), ncol = length(codes)))
setnames(new_data, codes)
new_data[, (codes) := subset_data]
learning.codes <- rbindlist(list(learning.codes, new_data), fill = TRUE)
}
#one-hot columns
one_hotted_data <- copy(learning.codes)
variable.list <- list()
#subset if don't want to one-hot certain columns
columns.to.encode <- codes[!codes %in% "SC_AGE_YEARS"]
for(col.name in columns.to.encode) {
unique_values <- unique(one_hotted_data[[col.name]])
for(value in unique_values) {
if(!is.na(value)) {
new.col.name <- paste(col.name, value, sep = "_")
variable.list[[new.col.name]] <- new.col.name
one_hotted_data[, (new.col.name) := as.integer(one_hotted_data[[col.name]] == value)]
}
}
}
#add a year column
one_hotted_data[, Year := ifelse(.I <= nrow(file.list[[1]]), "2016 - 2017", "2018-2019")]
#assigning folds
OneFold <- function(data, k, test_fold) {
n_rows <- nrow(data)
set.seed(123)
data[, Fold := sample(rep(1:k, l = .N))]
data[, Set := ifelse(Fold == test_fold, "test", "train")]
return(data)
}
# Example usage
install.packages('glmnet')
library(glmnet)
n.folds <- 3
main.autism.output.column.name <- "K2Q35A_1"
autism.outputs.list <- c("K2Q35A_2", "K2Q35A_1", "K2Q35A_99", "K2Q35B_95", "K2Q35B_1", "K2Q35B_99", "K2Q35B_2")
input.columns <- one_hotted_data[, !c(codes, autism.outputs.list, "Year", "Fold", "Set"), with = FALSE]
output.column <- one_hotted_data[["K2Q35A_1"]]
#How do I get a featureless model
glm.model.list <- list()
for(test_fold in 1:n.folds){
folded_data <- OneFold(one_hotted_data, n.folds, test_fold)
train.set <- folded_data[i = Set == "train"]
test.set <- folded_data[i = Set == "test"]
train.inputs <- train.set[, names(input.columns), with = FALSE]
test.inputs <- test.set[,names(input.columns), with = FALSE]
test.output <- test.set[, "K2Q35A_1", with = FALSE]
train.output <- train.set[, "K2Q35A_1", with = FALSE]
train.double.input <- as.matrix(train.inputs)
train.double.output <- as.matrix(train.output)
test.double.input <- as.matrix(test.inputs)
test.double.output <- as.matrix(test.o)
my.model.for.fold.k <- cv.glmnet(train.double.input, train.double.output, family = "multinomial")
glm.model.list[[test_fold]] <- my.model.for.fold.k
}
#predictions
library(glmnet)
n.algorithms <- 2
n.predictions <- n.folds * n.algorithms
glm.predictions.list <- list()
for(fold in 1:n.folds){
glm.prediction <- predict(glm.model.list[[fold]], test.double.input, type = "class")
glm.predictions.list[[fold]] <- glm.prediction
}
#Featureless Data
most.frequent.class <- train.output[, .(count=.N), by = main.autism.output.column.name][order(-count)][[main.autism.output.column.name]][1]
featureless.list <- list()
for(fold in 1:n.folds){
featureless <- rep(most.frequent.class, test.set$Fold == fold)
featureless.list[[fold]] <- featureless
}
pred.list <- list(glm.predictions.list, featureless.list)
#compute test accuracies
algorithms.list <- c("cv.glmnet", "featureless")
accuracy.dt.list <- list()
for(fold in 1:n.folds){
for(algorithm in algorithms.list){
pred.vec <- pred.list[[algorithm]]
is.correct <- pred.vec == test.double.output
accuracy.percent <- 100*mean(is.correct)
accuracy.dt.list[[paste(fold, algorithm
)]] <- data.table(fold, algorithm, accuracy.percent)
}
}
accuracy.dt <- data.table::rbindlist(accuracy.dt.list)
data.name <- zip
library(ggplot2)
ggplot()+
geom_point(aes(
accuracy.percent, algorithm),
shape=1,
data=accuracy.dt)+
facet_grid(. ~ data.name, labeller=label_both, scales="free")
What error/output are you getting?
The following is not a minimal reproducible example, because n.folds
and most.frequent.class
are not defined.
featureless.list <- list()
for(fold in 1:n.folds){
featureless <- rep(most.frequent.class, test.set$Fold == fold)
featureless.list[[fold]] <- featureless
}
You can tell if it is a MRE by restarting your R (or using reprex::reprex
), running the MRE, and if it gives the same error/output/issue are your larger code, then it is an MRE. If it gives different error/output/issue, then it is not useful (you need to provide more details).
putting the whole script is not minimal (there are lots of parts of the code which are not relevant to the problem you are observing)
What is the difference between labels and featureless then?I understand labels are the true class values but what is the featureless data then? Just a vector entirely composed of the major class? All zeros?Thanks,Trevor.On Dec 6, 2023, at 1:38 PM, Toby Dylan Hocking @.**> wrote: predictions should be a vector of predicted classes (one for each data point/row in the test set), and labels should be a vector of label/true class values. Please read https://tdhock.github.io/blog/2023/comparing-ml-frameworks/ in particular the relevant part for computing accuracy is below, pred.list <- list( cv_glmnet=factor(predict( glmnet.model, one.data.split$test$input.mat, type="class")), featureless=rep(most.freq.label, one.data.split$test$set.obs), "1nn"=class::knn( one.data.split$train$input.mat, one.data.split$test$input.mat, one.data.split$train$output.vec)) for(algorithm in names(pred.list)){ pred.vec <- pred.list[[algorithm]] is.correct <- pred.vec == one.data.split$test$output.vec accuracy.percent <- 100mean(is.correct) accuracy.dt.list[[paste( data.name, test.fold, algorithm )]] <- data.table( data.name, test.fold, algorithm, accuracy.percent) }
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>
featureless is a model which always predicts the most frequent class in train set labels.
Intended to fill test.accuracies.list with 6 test accuracy calculations to then make a datatable and plot.