trevb11 / ASD3-Coding

Commentary on ASD3 Coding Related Project
1 stars 2 forks source link

Test Accuracies #3

Open trevb11 opened 9 months ago

trevb11 commented 9 months ago
> test.accuracies.list <- list()
> for (prediction in all.predictions.list){
+   count_ones <- sum(all.predictions.list[[prediction]] == 1)
+   predicted.rate <- count_ones / nrow(all.predictions.list[[prediction]])
+   test.accuracy <- predicted.rate / true.rate
+   test.accuracies.list[[prediction]] <- test.accuracy
+ 
+ }
Error in all.predictions.list[[prediction]] : 
  attempt to select less than one element in get1index <real>

Intended to fill test.accuracies.list with 6 test accuracy calculations to then make a datatable and plot.

tdhock commented 9 months ago

hi when you post issues please use triple backticks to format code properly, which makes it easier to read.

```r
> test.accuracies.list <- list()
> for (prediction in all.predictions.list){
+   count_ones <- sum(all.predictions.list[[prediction]] == 1)
+   predicted.rate <- count_ones / nrow(all.predictions.list[[prediction]])
+   test.accuracy <- predicted.rate / true.rate
+   test.accuracies.list[[prediction]] <- test.accuracy
+ 
+ }
Error in all.predictions.list[[prediction]] : 
  attempt to select less than one element in get1index <real>

becomes 
```r
> test.accuracies.list <- list()
> for (prediction in all.predictions.list){
+   count_ones <- sum(all.predictions.list[[prediction]] == 1)
+   predicted.rate <- count_ones / nrow(all.predictions.list[[prediction]])
+   test.accuracy <- predicted.rate / true.rate
+   test.accuracies.list[[prediction]] <- test.accuracy
+ 
+ }
Error in all.predictions.list[[prediction]] : 
  attempt to select less than one element in get1index <real>
tdhock commented 9 months ago

the error message says "attempt to select less than one element in get1index " which seems to indicate that all.predictions.list is a vector of real numbers, and prediction is length zero? did you mean to do seq_along(all.predictions.list)? why are you doing sum(all.predictions.list[[prediction]] == 1) ? accuracy rate should be computed by mean(predictions == labels) Did you read my blog? https://tdhock.github.io/blog/2023/comparing-ml-frameworks/

trevb11 commented 9 months ago

I'm reading the blog now it's helpful. Making the necessary changes.

tdhock commented 9 months ago

hey @tdhock can you please help?

tdhock commented 9 months ago

please re-read https://stackoverflow.com/help/minimal-reproducible-example

trevb11 commented 9 months ago

@EngineerDanny could use some help with computing the test.accuracies I need if you have the time (3 values for cv.glmnet folds) (3 for featureless folds)

Toby said to use

 test.accuracy = mean(predictions == labels)

but I'm not sure what this means.

tdhock commented 9 months ago

predictions should be a vector of predicted classes (one for each data point/row in the test set), and labels should be a vector of label/true class values. Please read https://tdhock.github.io/blog/2023/comparing-ml-frameworks/ in particular the relevant part for computing accuracy is below,

   pred.list <- list(
      cv_glmnet=factor(predict(
        glmnet.model, one.data.split$test$input.mat, type="class")),
      featureless=rep(most.freq.label, one.data.split$test$set.obs),
      "1nn"=class::knn(
        one.data.split$train$input.mat,
        one.data.split$test$input.mat,
        one.data.split$train$output.vec))
    for(algorithm in names(pred.list)){
      pred.vec <- pred.list[[algorithm]]
      is.correct <- pred.vec == one.data.split$test$output.vec
      accuracy.percent <- 100*mean(is.correct)
      accuracy.dt.list[[paste(
        data.name, test.fold, algorithm
      )]] <- data.table(
        data.name, test.fold, algorithm, accuracy.percent)
    }
trevb11 commented 9 months ago

@tdhock Taking another try at a minimal reproducible example. Everything is pretty much working, but I am struggling to generate the featureless vectors for each fold in 1:n.folds. I think you need the whole code to run it... The error I'm having is in the block

featureless.list <- list()
for(fold in 1:n.folds){
  featureless <- rep(most.frequent.class, test.set$Fold == fold)
  featureless.list[[fold]] <- featureless
}

I don't believe the test.set$Fold == fold is correct. I will come by office hours tomorrow morning as well. Whole script...

setwd("~/Users/jared/ASD_CSV_FILES")
library(MASS)
library(stats)
library(data.table)
library(readxl)
library(dplyr)

file.list <- list()
list_of_file_names <- c("2016_17_NSCH_Topical_Implicate_CAHMI_DRCv2.csv", "2018-2019_NSCH_Topical_CAHMI_DRCv2.csv")
for(name in list_of_file_names){
    file_path <- file.path(getwd(), name)
    file.list[[name]] <-data.table::fread(file_path)
}

#make a data frame of codes and their category
COMORBIDITY <- c("K2Q34A", "K2Q34B", "K2Q36A", "K2Q36B", "K2Q60A", "K2Q60B", "K2Q37A", "K2Q37B",
                 "K2Q30A", "K2Q30B", "K2Q31A", "K2Q31B")

RESIDENCE <- c("LIVEUSA_MO", "K11Q43R")

DEMOGRAPHIC <- c("SC_RACER", "SC_HISPANIC_R", "HOUSE_GEN", "A1_GRADE", "A2_GRADE", "HHLANGUAGE", "SC_AGE_YEARS", "SC_SEX")

outputs <- c("K2Q35A", "K2Q35B")

codes <- c(COMORBIDITY, RESIDENCE, DEMOGRAPHIC, outputs)

descriptions <- c("Behavior Problems", "Behavior Problems Currently", "Developmental Delay", "Developmental Delay Currently",
                    "Intellectual Disability", "Intellectual Disability Currently",
                  "Speech Disorder", "Speech Disorder Currently", "Learning Disability", "Learning Disability Currently",
                  "ADD/ADHD", "ADD/ADHD Currently",
                  "How Long Living in the United States - Years", "How Many Times Moved to New Address", "Race of Selected Child, Recode",
                  "Hispanic Origin of Selected Child, Recode", "Parental Nativity", "Adult 1 - Highest Completed Year of School", "Adult 2 - Highest Completed Year of School",
                  "Primary HH Language", "Age of Seleced Child - In Years", "Sex of Selected Child", "Autism ASD", "Autism ASD Currently")

code_dictionary <- setNames(descriptions, codes)
categories <- rep(c("COMORBIDITY", "RESIDENCE", "DEMOGRAPHIC"), c(length(COMORBIDITY), length(RESIDENCE), length(DEMOGRAPHIC)))

#creating learning.codes table
learning.codes <- data.table()
for(file_name in names(file.list)){
  subset_data <- file.list[[file_name]][, ..codes]

  new_data <- data.table(matrix(NA, nrow = nrow(file.list[[file_name]]), ncol = length(codes)))
                         setnames(new_data, codes)

  new_data[, (codes) := subset_data]

  learning.codes <- rbindlist(list(learning.codes, new_data), fill = TRUE)

}

#one-hot columns
one_hotted_data <- copy(learning.codes)
variable.list <- list()

#subset if don't want to one-hot certain columns

columns.to.encode <- codes[!codes %in% "SC_AGE_YEARS"]
for(col.name in columns.to.encode) {

  unique_values <- unique(one_hotted_data[[col.name]])

  for(value in unique_values) {

    if(!is.na(value)) {

      new.col.name <- paste(col.name, value, sep = "_")

      variable.list[[new.col.name]] <- new.col.name

      one_hotted_data[, (new.col.name) := as.integer(one_hotted_data[[col.name]] == value)]
    }
  }
}

#add a year column

one_hotted_data[, Year := ifelse(.I <= nrow(file.list[[1]]), "2016 - 2017", "2018-2019")]

#assigning folds

OneFold <- function(data, k, test_fold) {
  n_rows <- nrow(data)

  set.seed(123)

  data[, Fold := sample(rep(1:k, l = .N))]

  data[, Set := ifelse(Fold == test_fold, "test", "train")]

  return(data)

}
# Example usage
install.packages('glmnet')
library(glmnet)

n.folds <- 3
main.autism.output.column.name <- "K2Q35A_1"
autism.outputs.list <- c("K2Q35A_2", "K2Q35A_1", "K2Q35A_99", "K2Q35B_95", "K2Q35B_1", "K2Q35B_99", "K2Q35B_2")
input.columns <- one_hotted_data[, !c(codes, autism.outputs.list, "Year", "Fold", "Set"), with = FALSE]
output.column <- one_hotted_data[["K2Q35A_1"]]

#How do I get a featureless model
glm.model.list <- list()
for(test_fold in 1:n.folds){
  folded_data <- OneFold(one_hotted_data, n.folds, test_fold)
  train.set <- folded_data[i = Set == "train"]
  test.set <- folded_data[i = Set == "test"]
  train.inputs <- train.set[, names(input.columns), with = FALSE]
  test.inputs <- test.set[,names(input.columns), with = FALSE]
  test.output <- test.set[, "K2Q35A_1", with = FALSE]
  train.output <- train.set[, "K2Q35A_1", with = FALSE]
  train.double.input <- as.matrix(train.inputs)
  train.double.output <- as.matrix(train.output)
  test.double.input <- as.matrix(test.inputs)
  test.double.output <- as.matrix(test.o)
  my.model.for.fold.k <- cv.glmnet(train.double.input, train.double.output, family = "multinomial")
  glm.model.list[[test_fold]] <- my.model.for.fold.k
}

#predictions
library(glmnet)
n.algorithms <- 2
n.predictions <- n.folds * n.algorithms
glm.predictions.list <- list()
for(fold in 1:n.folds){
  glm.prediction <- predict(glm.model.list[[fold]], test.double.input, type = "class")
  glm.predictions.list[[fold]] <- glm.prediction
}

#Featureless Data
most.frequent.class <- train.output[, .(count=.N), by = main.autism.output.column.name][order(-count)][[main.autism.output.column.name]][1]
featureless.list <- list()
for(fold in 1:n.folds){
  featureless <- rep(most.frequent.class, test.set$Fold == fold)
  featureless.list[[fold]] <- featureless
}

pred.list <- list(glm.predictions.list, featureless.list)

#compute test accuracies
algorithms.list <- c("cv.glmnet", "featureless")
accuracy.dt.list <- list()
for(fold in 1:n.folds){
  for(algorithm in algorithms.list){
    pred.vec <- pred.list[[algorithm]]
    is.correct <- pred.vec == test.double.output
    accuracy.percent <- 100*mean(is.correct)
    accuracy.dt.list[[paste(fold, algorithm
    )]] <- data.table(fold, algorithm, accuracy.percent)
  }
}

accuracy.dt <- data.table::rbindlist(accuracy.dt.list)
data.name <- zip
library(ggplot2)
ggplot()+
  geom_point(aes(
    accuracy.percent, algorithm),
    shape=1,
    data=accuracy.dt)+
  facet_grid(. ~ data.name, labeller=label_both, scales="free")
tdhock commented 9 months ago

What error/output are you getting? The following is not a minimal reproducible example, because n.folds and most.frequent.class are not defined.

featureless.list <- list()
for(fold in 1:n.folds){
  featureless <- rep(most.frequent.class, test.set$Fold == fold)
  featureless.list[[fold]] <- featureless
}

You can tell if it is a MRE by restarting your R (or using reprex::reprex), running the MRE, and if it gives the same error/output/issue are your larger code, then it is an MRE. If it gives different error/output/issue, then it is not useful (you need to provide more details).

tdhock commented 9 months ago

putting the whole script is not minimal (there are lots of parts of the code which are not relevant to the problem you are observing)

trevb11 commented 9 months ago

What is the difference between labels and featureless then?I understand labels are the true class values but what is the featureless data then? Just a vector entirely composed of the major class? All zeros?Thanks,Trevor.On Dec 6, 2023, at 1:38 PM, Toby Dylan Hocking @.**> wrote: predictions should be a vector of predicted classes (one for each data point/row in the test set), and labels should be a vector of label/true class values. Please read https://tdhock.github.io/blog/2023/comparing-ml-frameworks/ in particular the relevant part for computing accuracy is below, pred.list <- list( cv_glmnet=factor(predict( glmnet.model, one.data.split$test$input.mat, type="class")), featureless=rep(most.freq.label, one.data.split$test$set.obs), "1nn"=class::knn( one.data.split$train$input.mat, one.data.split$test$input.mat, one.data.split$train$output.vec)) for(algorithm in names(pred.list)){ pred.vec <- pred.list[[algorithm]] is.correct <- pred.vec == one.data.split$test$output.vec accuracy.percent <- 100mean(is.correct) accuracy.dt.list[[paste( data.name, test.fold, algorithm )]] <- data.table( data.name, test.fold, algorithm, accuracy.percent) }

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

tdhock commented 9 months ago

featureless is a model which always predicts the most frequent class in train set labels.