signaturescience / skater

SKATE R Utilities
https://signaturescience.github.io/skater/
Other
9 stars 5 forks source link

Reciprocal RMSE + 0.5 on categorical data #42

Closed stephenturner closed 3 years ago

stephenturner commented 3 years ago

Fixes #38. @vpnagraj please review.

This modifies the confusion_matrix() function to return a new list item, recip_rmse. This requires that the classes for Predicted and Target have >1 numeric classes* and exactly one non-numeric class. E.g., "1", "2", "3", and "Unrelated". If there are zero nonnumeric classes, or if there are >1 nonnumeric classes the returned recip_rmse object will be a string describing why this was not calculated.

* = "numeric classes": I recognize that when the numeric c(1, 2, 3, NA) is turned into the char c("1", "2", "3", "Unrelated"), what I'm calling "numeric classes" aren't actually numeric. The procedure here tries to coerce something like c("1", "2", "3", "Unrelated") into a numeric (while suppressing warnings) and looks for the presence of one and only one NA that results from this attempt. If so, it then sees what the max of the "numeric" classes are (3 in this case), and assigns the non-number ("Unrelated" here) to be the max+1 (i.e., 4 in this case here).

The precise code that implements this procedure is here:

https://github.com/signaturescience/skater/blob/d9cf1c574d2b948949608033e25a361fafafaae0/R/confusion_matrix.R#L410-L440

This PR also updates the docs for that function and adds a few examples to demonstrate this feature.

vpnagraj commented 3 years ago

@stephenturner this looks great! thanks for carrying the idea through into the confusion_matrix() function. i agree that reciprocal is the way to go here.

a comment on the implementation. it looks like the code requires that at least one degree is "unrelated" ... otherwise it returns a character vector: "Reciprocal RMSE not calculated: zero non-numeric classes."

why? wouldnt the reciprocal RMSE also be of interest even if the degrees are 1,2,3,4,5, etc and there are no unrelated?

another note on that code (https://github.com/signaturescience/skater/blob/d9cf1c574d2b948949608033e25a361fafafaae0/R/confusion_matrix.R#L419) ... why return a character vector instead of NA (maybe with a message()) ? asking because i can expect this might be a pain if we stack results on top of each other and the RMSE values get coerced to character

to "prove" the RMSE concept i took the example code in the original issue (https://github.com/signaturescience/skater/issues/38) , which looks at the IBIS vs AKT performance for a simulated dataset. you should be able to run on the code. this one is interesting because the tool with higher accuracy also has a higher RMSE. so the performance comparison flips based on the metric you choose!

library(tidyverse)
library(skater)

truth <-
  read_fam("/data/projects/skate/data/sims_2021-03-30/sims/ASW-gsa37-5GHS-er_0-ms_0-hom_0-everyone.fam") %>%
  fam2ped() %>%
  mutate(x=map(ped, ped2kinpair)) %>%
  select(x) %>%
  unnest(x) %>%
  mutate(d.truth=kin2degree(k, max_degree=7)) %>%
  rename(k.truth=k) %>%
  mutate(d3.truth=ifelse(d.truth>3L, NA, d.truth)) %>% 
  mutate(d4.truth=ifelse(d.truth>4L, NA, d.truth)) %>% 
  mutate(d5.truth=ifelse(d.truth>5L, NA, d.truth)) %>%
  print()

akt <-
  read_akt("/data/projects/skate/data/sims_2021-03-30/akt/ASW-gsa37-5GHS-er_0-ms_0-hom_0.akt") %>%
  arrange_ids(., id1, id2) %>% 
  rename(k.akt=k) %>% 
  mutate(d.akt=kin2degree(k.akt, max_degree=7L)) %>% 
  mutate(d3.akt=ifelse(d.akt>3L, NA, d.akt)) %>% 
  mutate(d4.akt=ifelse(d.akt>4L, NA, d.akt)) %>% 
  mutate(d5.akt=ifelse(d.akt>5L, NA, d.akt))

ibis <-
  read_ibis("/data/projects/skate/data/sims_2021-03-30/ibis/ASW-gsa37-5GHS-er_0-ms_0-hom_0.ibd.coef") %>%
  arrange_ids(., id1, id2) %>% 
  rename(k.ibis=k) %>% 
  rename(d.ibis=degree) %>% 
  mutate(d.ibis =ifelse(d.ibis<0, NA_integer_, d.ibis)) %>% 
  mutate(d3.ibis=ifelse(d.ibis>3L, NA, d.ibis)) %>% 
  mutate(d4.ibis=ifelse(d.ibis>4L, NA, d.ibis)) %>% 
  mutate(d5.ibis=ifelse(d.ibis>5L, NA, d.ibis))

joined <-
  truth %>%
  inner_join(akt, by = c("id1", "id2")) %>% 
  assertr::verify(nrow(.)==nrow(akt)) %>% 
  inner_join(ibis, by = c("id1", "id2")) %>%
  assertr::verify(nrow(.)==nrow(akt)) %>%
  assertr::verify(nrow(.)==nrow(ibis)) %>%
  mutate(D.truth=d3.truth) %>% 
  mutate(D.akt=d3.akt) %>% 
  mutate(D.ibis=d3.ibis) %>% 
  mutate(across(starts_with("D."), ~factor(replace_na(., "Unrelated")))) %>% 
  select(id1, id2, starts_with("k."), starts_with("D.", ignore.case=FALSE))

joined

## first look AKT results
confusion_matrix(prediction = joined$D.akt, target = joined$D.truth) %>%
  pluck("Accuracy")

confusion_matrix(prediction = joined$D.akt, target = joined$D.truth) %>%
  pluck("recip_rmse")

confusion_matrix(prediction = joined$D.akt, target = joined$D.truth) %>%
  pluck("Table")

## now look at IBIS
confusion_matrix(prediction = joined$D.ibis, target = joined$D.truth) %>%
  pluck("Accuracy")

confusion_matrix(prediction = joined$D.ibis, target = joined$D.truth) %>%
  pluck("recip_rmse")

confusion_matrix(prediction = joined$D.ibis, target = joined$D.truth) %>%
  pluck("Table")
stephenturner commented 3 years ago

Thanks for looking at that. Both comments are good. I'll fix both issues. There's no reason that the ultimate degree has to be a char/unrelated, I was just assuming that it would be. The way the degree inference works now is to keep it numeric it tells you 1, 2, 3, NA if you set max degree=3. I later turn that NA into something meaningful ("Unrelated" sorts last as a factor, so it's a convenient side effect for nice-looking contingency tables). Also good idea on NA+message.

stephenturner commented 3 years ago

Thanks @vpnagraj. I think I've fixed the issue in ac2a736a6178141412a30db989bb99b65762f017

If you have >1 non-numeric class it'll message and return an NA.

If you have 1 non-numeric class or 0 non-numeric classes, it works the same.

I changed very little, and it still works, but this has a "smell" to it. If your table looks like this:

> conf_mat2
         Target
Predicted  1  2  3
        1 50  0  0
        2  0 40  0
        3  0 10 50

When you try to figure out which of these colnames isn't a number by seeing which returns an NA when you coerce to as numeric, when you have no non-numeric classes, it returns a zero-length int

>   which_unrelated <- which(is.na(suppressWarnings(as.numeric(conf_mat_names))))
> which_unrelated
integer(0)

Later, you try to pull out what the column names were that corresponded to the non-numeric class by indexing over that zero length int returns a zero length char:

> conf_mat_names[which_unrelated]
character(0)

Here's the smelly part. When you try to modify in place that vector of names ("1", "2", "3") by replacing values indexed by this zero length int, it returns the same vector (which is actually what we want).

> conf_mat_names
[1] "1" "2" "3"

> conf_mat_names[which_unrelated]
character(0)

> conf_mat_names[which_unrelated] <- max_numeric_class+1

> conf_mat_names
[1] "1" "2" "3"

TL;DR, it works, and this is such a fundamental feature of the R language that I don't think it's dangerous, but weird.