trevb11 / ASD3-Coding

Commentary on ASD3 Coding Related Project
1 stars 2 forks source link

Test Accuracies #3

Open trevb11 opened 12 months ago

trevb11 commented 12 months ago
> test.accuracies.list <- list()
> for (prediction in all.predictions.list){
+   count_ones <- sum(all.predictions.list[[prediction]] == 1)
+   predicted.rate <- count_ones / nrow(all.predictions.list[[prediction]])
+   test.accuracy <- predicted.rate / true.rate
+   test.accuracies.list[[prediction]] <- test.accuracy
+ 
+ }
Error in all.predictions.list[[prediction]] : 
  attempt to select less than one element in get1index <real>

Intended to fill test.accuracies.list with 6 test accuracy calculations to then make a datatable and plot.

tdhock commented 12 months ago

hi when you post issues please use triple backticks to format code properly, which makes it easier to read.

```r
> test.accuracies.list <- list()
> for (prediction in all.predictions.list){
+   count_ones <- sum(all.predictions.list[[prediction]] == 1)
+   predicted.rate <- count_ones / nrow(all.predictions.list[[prediction]])
+   test.accuracy <- predicted.rate / true.rate
+   test.accuracies.list[[prediction]] <- test.accuracy
+ 
+ }
Error in all.predictions.list[[prediction]] : 
  attempt to select less than one element in get1index <real>

becomes 
```r
> test.accuracies.list <- list()
> for (prediction in all.predictions.list){
+   count_ones <- sum(all.predictions.list[[prediction]] == 1)
+   predicted.rate <- count_ones / nrow(all.predictions.list[[prediction]])
+   test.accuracy <- predicted.rate / true.rate
+   test.accuracies.list[[prediction]] <- test.accuracy
+ 
+ }
Error in all.predictions.list[[prediction]] : 
  attempt to select less than one element in get1index <real>
tdhock commented 12 months ago

the error message says "attempt to select less than one element in get1index " which seems to indicate that all.predictions.list is a vector of real numbers, and prediction is length zero? did you mean to do seq_along(all.predictions.list)? why are you doing sum(all.predictions.list[[prediction]] == 1) ? accuracy rate should be computed by mean(predictions == labels) Did you read my blog? https://tdhock.github.io/blog/2023/comparing-ml-frameworks/

trevb11 commented 12 months ago

I'm reading the blog now it's helpful. Making the necessary changes.

tdhock commented 12 months ago

hey @tdhock can you please help?

tdhock commented 12 months ago

please re-read https://stackoverflow.com/help/minimal-reproducible-example

trevb11 commented 12 months ago

@EngineerDanny could use some help with computing the test.accuracies I need if you have the time (3 values for cv.glmnet folds) (3 for featureless folds)

Toby said to use

 test.accuracy = mean(predictions == labels)

but I'm not sure what this means.

tdhock commented 12 months ago

predictions should be a vector of predicted classes (one for each data point/row in the test set), and labels should be a vector of label/true class values. Please read https://tdhock.github.io/blog/2023/comparing-ml-frameworks/ in particular the relevant part for computing accuracy is below,

   pred.list <- list(
      cv_glmnet=factor(predict(
        glmnet.model, one.data.split$test$input.mat, type="class")),
      featureless=rep(most.freq.label, one.data.split$test$set.obs),
      "1nn"=class::knn(
        one.data.split$train$input.mat,
        one.data.split$test$input.mat,
        one.data.split$train$output.vec))
    for(algorithm in names(pred.list)){
      pred.vec <- pred.list[[algorithm]]
      is.correct <- pred.vec == one.data.split$test$output.vec
      accuracy.percent <- 100*mean(is.correct)
      accuracy.dt.list[[paste(
        data.name, test.fold, algorithm
      )]] <- data.table(
        data.name, test.fold, algorithm, accuracy.percent)
    }
trevb11 commented 12 months ago

@tdhock Taking another try at a minimal reproducible example. Everything is pretty much working, but I am struggling to generate the featureless vectors for each fold in 1:n.folds. I think you need the whole code to run it... The error I'm having is in the block

featureless.list <- list()
for(fold in 1:n.folds){
  featureless <- rep(most.frequent.class, test.set$Fold == fold)
  featureless.list[[fold]] <- featureless
}

I don't believe the test.set$Fold == fold is correct. I will come by office hours tomorrow morning as well. Whole script...

setwd("~/Users/jared/ASD_CSV_FILES")
library(MASS)
library(stats)
library(data.table)
library(readxl)
library(dplyr)

file.list <- list()
list_of_file_names <- c("2016_17_NSCH_Topical_Implicate_CAHMI_DRCv2.csv", "2018-2019_NSCH_Topical_CAHMI_DRCv2.csv")
for(name in list_of_file_names){
    file_path <- file.path(getwd(), name)
    file.list[[name]] <-data.table::fread(file_path)
}

#make a data frame of codes and their category
COMORBIDITY <- c("K2Q34A", "K2Q34B", "K2Q36A", "K2Q36B", "K2Q60A", "K2Q60B", "K2Q37A", "K2Q37B",
                 "K2Q30A", "K2Q30B", "K2Q31A", "K2Q31B")

RESIDENCE <- c("LIVEUSA_MO", "K11Q43R")

DEMOGRAPHIC <- c("SC_RACER", "SC_HISPANIC_R", "HOUSE_GEN", "A1_GRADE", "A2_GRADE", "HHLANGUAGE", "SC_AGE_YEARS", "SC_SEX")

outputs <- c("K2Q35A", "K2Q35B")

codes <- c(COMORBIDITY, RESIDENCE, DEMOGRAPHIC, outputs)

descriptions <- c("Behavior Problems", "Behavior Problems Currently", "Developmental Delay", "Developmental Delay Currently",
                    "Intellectual Disability", "Intellectual Disability Currently",
                  "Speech Disorder", "Speech Disorder Currently", "Learning Disability", "Learning Disability Currently",
                  "ADD/ADHD", "ADD/ADHD Currently",
                  "How Long Living in the United States - Years", "How Many Times Moved to New Address", "Race of Selected Child, Recode",
                  "Hispanic Origin of Selected Child, Recode", "Parental Nativity", "Adult 1 - Highest Completed Year of School", "Adult 2 - Highest Completed Year of School",
                  "Primary HH Language", "Age of Seleced Child - In Years", "Sex of Selected Child", "Autism ASD", "Autism ASD Currently")

code_dictionary <- setNames(descriptions, codes)
categories <- rep(c("COMORBIDITY", "RESIDENCE", "DEMOGRAPHIC"), c(length(COMORBIDITY), length(RESIDENCE), length(DEMOGRAPHIC)))

#creating learning.codes table
learning.codes <- data.table()
for(file_name in names(file.list)){
  subset_data <- file.list[[file_name]][, ..codes]

  new_data <- data.table(matrix(NA, nrow = nrow(file.list[[file_name]]), ncol = length(codes)))
                         setnames(new_data, codes)

  new_data[, (codes) := subset_data]

  learning.codes <- rbindlist(list(learning.codes, new_data), fill = TRUE)

}

#one-hot columns
one_hotted_data <- copy(learning.codes)
variable.list <- list()

#subset if don't want to one-hot certain columns

columns.to.encode <- codes[!codes %in% "SC_AGE_YEARS"]
for(col.name in columns.to.encode) {

  unique_values <- unique(one_hotted_data[[col.name]])

  for(value in unique_values) {

    if(!is.na(value)) {

      new.col.name <- paste(col.name, value, sep = "_")

      variable.list[[new.col.name]] <- new.col.name

      one_hotted_data[, (new.col.name) := as.integer(one_hotted_data[[col.name]] == value)]
    }
  }
}

#add a year column

one_hotted_data[, Year := ifelse(.I <= nrow(file.list[[1]]), "2016 - 2017", "2018-2019")]

#assigning folds

OneFold <- function(data, k, test_fold) {
  n_rows <- nrow(data)

  set.seed(123)

  data[, Fold := sample(rep(1:k, l = .N))]

  data[, Set := ifelse(Fold == test_fold, "test", "train")]

  return(data)

}
# Example usage
install.packages('glmnet')
library(glmnet)

n.folds <- 3
main.autism.output.column.name <- "K2Q35A_1"
autism.outputs.list <- c("K2Q35A_2", "K2Q35A_1", "K2Q35A_99", "K2Q35B_95", "K2Q35B_1", "K2Q35B_99", "K2Q35B_2")
input.columns <- one_hotted_data[, !c(codes, autism.outputs.list, "Year", "Fold", "Set"), with = FALSE]
output.column <- one_hotted_data[["K2Q35A_1"]]

#How do I get a featureless model
glm.model.list <- list()
for(test_fold in 1:n.folds){
  folded_data <- OneFold(one_hotted_data, n.folds, test_fold)
  train.set <- folded_data[i = Set == "train"]
  test.set <- folded_data[i = Set == "test"]
  train.inputs <- train.set[, names(input.columns), with = FALSE]
  test.inputs <- test.set[,names(input.columns), with = FALSE]
  test.output <- test.set[, "K2Q35A_1", with = FALSE]
  train.output <- train.set[, "K2Q35A_1", with = FALSE]
  train.double.input <- as.matrix(train.inputs)
  train.double.output <- as.matrix(train.output)
  test.double.input <- as.matrix(test.inputs)
  test.double.output <- as.matrix(test.o)
  my.model.for.fold.k <- cv.glmnet(train.double.input, train.double.output, family = "multinomial")
  glm.model.list[[test_fold]] <- my.model.for.fold.k
}

#predictions
library(glmnet)
n.algorithms <- 2
n.predictions <- n.folds * n.algorithms
glm.predictions.list <- list()
for(fold in 1:n.folds){
  glm.prediction <- predict(glm.model.list[[fold]], test.double.input, type = "class")
  glm.predictions.list[[fold]] <- glm.prediction
}

#Featureless Data
most.frequent.class <- train.output[, .(count=.N), by = main.autism.output.column.name][order(-count)][[main.autism.output.column.name]][1]
featureless.list <- list()
for(fold in 1:n.folds){
  featureless <- rep(most.frequent.class, test.set$Fold == fold)
  featureless.list[[fold]] <- featureless
}

pred.list <- list(glm.predictions.list, featureless.list)

#compute test accuracies
algorithms.list <- c("cv.glmnet", "featureless")
accuracy.dt.list <- list()
for(fold in 1:n.folds){
  for(algorithm in algorithms.list){
    pred.vec <- pred.list[[algorithm]]
    is.correct <- pred.vec == test.double.output
    accuracy.percent <- 100*mean(is.correct)
    accuracy.dt.list[[paste(fold, algorithm
    )]] <- data.table(fold, algorithm, accuracy.percent)
  }
}

accuracy.dt <- data.table::rbindlist(accuracy.dt.list)
data.name <- zip
library(ggplot2)
ggplot()+
  geom_point(aes(
    accuracy.percent, algorithm),
    shape=1,
    data=accuracy.dt)+
  facet_grid(. ~ data.name, labeller=label_both, scales="free")
tdhock commented 12 months ago

What error/output are you getting? The following is not a minimal reproducible example, because n.folds and most.frequent.class are not defined.

featureless.list <- list()
for(fold in 1:n.folds){
  featureless <- rep(most.frequent.class, test.set$Fold == fold)
  featureless.list[[fold]] <- featureless
}

You can tell if it is a MRE by restarting your R (or using reprex::reprex), running the MRE, and if it gives the same error/output/issue are your larger code, then it is an MRE. If it gives different error/output/issue, then it is not useful (you need to provide more details).

tdhock commented 12 months ago

putting the whole script is not minimal (there are lots of parts of the code which are not relevant to the problem you are observing)

trevb11 commented 12 months ago

What is the difference between labels and featureless then?I understand labels are the true class values but what is the featureless data then? Just a vector entirely composed of the major class? All zeros?Thanks,Trevor.On Dec 6, 2023, at 1:38 PM, Toby Dylan Hocking @.**> wrote: predictions should be a vector of predicted classes (one for each data point/row in the test set), and labels should be a vector of label/true class values. Please read https://tdhock.github.io/blog/2023/comparing-ml-frameworks/ in particular the relevant part for computing accuracy is below, pred.list <- list( cv_glmnet=factor(predict( glmnet.model, one.data.split$test$input.mat, type="class")), featureless=rep(most.freq.label, one.data.split$test$set.obs), "1nn"=class::knn( one.data.split$train$input.mat, one.data.split$test$input.mat, one.data.split$train$output.vec)) for(algorithm in names(pred.list)){ pred.vec <- pred.list[[algorithm]] is.correct <- pred.vec == one.data.split$test$output.vec accuracy.percent <- 100mean(is.correct) accuracy.dt.list[[paste( data.name, test.fold, algorithm )]] <- data.table( data.name, test.fold, algorithm, accuracy.percent) }

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

tdhock commented 12 months ago

featureless is a model which always predicts the most frequent class in train set labels.