more robust accuracy metrics for confusion matrix counts

we've circled around this on calls, but i thought i'd kick off a thread to describe the problem (and propose a solution) here ...

right now our strategy for assessing categorical accuracy of degree assignments treats the outcome as nominal. but in practice, the 1st, 2nd, 3rd ... unrelated could (should?) be interpreted as an ordinal structure. the typical accuracy measurements that we're getting from confusion_matrix() are not taking this into account.

a simple example ... let's assume we have two tools classifying targets as "1", "2", or "3". tool1 correctly classifies 9 of 10, but misclassifies a "1" as a "2". tool2 also correctly classifies 9 of 10, but misclassifies a "1" as a "3". both are 90% accurate, but if these targets have an ordinal meaning then this could be misleading.

the code below demonstrates:

library(tidyverse)
library(skater)

tool1 <- 
  tibble(predicted = c(rep("1",7), "2", "2", "3"),
         truth = c(rep("1",8), "2","3"))

tool2 <-
  tibble(predicted = c(rep("1",7), "3", "2", "3"),
         truth = c(rep("1",8), "2","3"))

cm_tool1 <- confusion_matrix(prediction = tool1$predicted, target = tool1$truth)

## traditional measure of accuracy
cm_tool1 %>%
  pluck("Accuracy") %>%
  pull(Accuracy)

## average balanced accurarcy
cm_tool1 %>%
  pluck("Other") %>%
  filter(Class == "Average") %>%
  pull(`Balanced Accuracy`)

## the tabular output
cm_tool1 %>%
  pluck("Table")

## tool2 confusion matrix
cm_tool2 <- confusion_matrix(prediction = tool2$predicted, target = tool2$truth)

## traditional measure of accuracy
cm_tool2 %>%
  pluck("Accuracy") %>%
  pull(Accuracy)

## average balanced accurarcy
cm_tool2 %>%
  pluck("Other") %>%
  filter(Class == "Average") %>%
  pull(`Balanced Accuracy`)

## the tabular output
cm_tool2 %>%
  pluck("Table")

what to do?

there are definitely methods for assessing ordinal outcomes in a confusion matrix (for example https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5217743/)

in reading about this, one approach that came up is to use the RMSE. given how we have our classification data organized, this shouldnt be a huge lift ... so long as we can apply some numeric value to the "Unrelated" class.

i've written a simple helper function to calculate the RMSE from the confusion matrix "Table" output. this could live in skate and maybe even be called in confusion_matrix() to tack on another metric in the named list

continuing on the example above:

rmse_table <- function(.Table, .unrelated = 10) {
  .Table %>%
    dplyr::as_tibble(.) %>%
    dplyr::mutate(Target = as.numeric(ifelse(Target == "Unrelated", .unrelated, Target))) %>%
    dplyr::mutate(Predicted = as.numeric(ifelse(Predicted == "Unrelated", .unrelated, Predicted))) %>%
    tidyr::uncount(n) %>%
    dplyr::summarise(rmse = sqrt(mean(Predicted-Target)^2)) %>%
    dplyr::pull(rmse)
}

cm_tool1 %>%
  pluck("Table") %>%
  rmse_table()

cm_tool2 %>%
  pluck("Table") %>%
  rmse_table()

and with one of the simulations (including "Unrelated" class):

order_ids <- function(.data) {
  .data %>% 
    mutate(.x1=pmin(id1, id2), .x2=pmax(id1, id2)) %>% 
    mutate(id1=.x1, id2=.x2) %>% 
    select(-.x1, -.x2)
}

truth <-
  read_fam("/data/projects/skate/data/sims_2021-03-30/sims/ASW-gsa37-5GHS-er_0-ms_0-hom_0-everyone.fam") %>%
  fam2ped() %>%
  mutate(x=map(ped, ped2kinpair)) %>%
  select(x) %>%
  unnest(x) %>%
  mutate(d.truth=kin2degree(k, max_degree=7)) %>%
  rename(k.truth=k) %>%
  mutate(d3.truth=ifelse(d.truth>3L, NA, d.truth)) %>% 
  mutate(d4.truth=ifelse(d.truth>4L, NA, d.truth)) %>% 
  mutate(d5.truth=ifelse(d.truth>5L, NA, d.truth)) %>%
  print()

akt <-
  read_akt("/data/projects/skate/data/sims_2021-03-30/akt/ASW-gsa37-5GHS-er_0-ms_0-hom_0.akt") %>%
  order_ids() %>% 
  rename(k.akt=k) %>% 
  mutate(d.akt=kin2degree(k.akt, max_degree=7L)) %>% 
  mutate(d3.akt=ifelse(d.akt>3L, NA, d.akt)) %>% 
  mutate(d4.akt=ifelse(d.akt>4L, NA, d.akt)) %>% 
  mutate(d5.akt=ifelse(d.akt>5L, NA, d.akt))

ibis <-
  read_ibis("/data/projects/skate/data/sims_2021-03-30/ibis/ASW-gsa37-5GHS-er_0-ms_0-hom_0.ibd.coef") %>%
  order_ids() %>% 
  rename(k.ibis=k) %>% 
  rename(d.ibis=degree) %>% 
  mutate(d.ibis =ifelse(d.ibis<0, NA_integer_, d.ibis)) %>% 
  mutate(d3.ibis=ifelse(d.ibis>3L, NA, d.ibis)) %>% 
  mutate(d4.ibis=ifelse(d.ibis>4L, NA, d.ibis)) %>% 
  mutate(d5.ibis=ifelse(d.ibis>5L, NA, d.ibis))

joined <-
  truth %>%
  inner_join(akt, by = c("id1", "id2")) %>% 
  assertr::verify(nrow(.)==nrow(akt)) %>% 
  inner_join(ibis, by = c("id1", "id2")) %>%
  assertr::verify(nrow(.)==nrow(akt)) %>%
  assertr::verify(nrow(.)==nrow(ibis)) %>%
  mutate(D.truth=d3.truth) %>% 
  mutate(D.akt=d3.akt) %>% 
  mutate(D.ibis=d3.ibis) %>% 
  mutate(across(starts_with("D."), ~factor(replace_na(., "Unrelated")))) %>% 
  select(id1, id2, starts_with("k."), starts_with("D.", ignore.case=FALSE))

joined

confusion_matrix(prediction = joined$D.akt, target = joined$D.truth) %>%
  pluck("Accuracy")

confusion_matrix(prediction = joined$D.akt, target = joined$D.truth) %>%
  pluck("Table")

confusion_matrix(prediction = joined$D.akt, target = joined$D.truth) %>%
  pluck("Table") %>%
  rmse_table(.)

confusion_matrix(prediction = joined$D.ibis, target = joined$D.truth) %>%
  pluck("Table") 

confusion_matrix(prediction = joined$D.ibis, target = joined$D.truth) %>%
  pluck("Table") %>%
  rmse_table(.)

confusion_matrix(prediction = joined$D.ibis, target = joined$D.truth) %>%
  pluck("Accuracy")

i think this approach makes sense? i've assigned myself in case we want to pursue further.

@stephenturner and @genignored what do you think ??

I like the idea. I ran through your examples, and created a few of my own, and convincing myself that this works as envisioned. It's "penalizing" a 1 to 3 jump twice as much as a 1 to 2 jump. Simplifying:

rmse <- function(Target, Predicted) sqrt(mean((Target-Predicted)^2))

# one jumps
rmse(Target=rep(1, 10), Predicted=rep(2,10))
#> [1] 1
rmse(Target=rep(2, 10), Predicted=rep(3,10))
#> [1] 1
rmse(Target=rep(3, 10), Predicted=rep(4,10))
#> [1] 1

# two jumps
rmse(Target=rep(1, 10), Predicted=rep(3,10))
#> [1] 2
rmse(Target=rep(2, 10), Predicted=rep(4,10))
#> [1] 2

# Three jumps
rmse(Target=rep(1, 10), Predicted=rep(4,10))
#> [1] 3

IMHO, this is a definite improvement over pure "classification accuracy". But, IIUC, this does not penalize more harshly "more egregious" errors. I would argue (and please steel man a rebuttal here), that calling a true 1D relationship a 2D is much worse than calling a 3D relationship a 4D. Or, for 2d jumps, calling a true 1D a 3d is worse than calling a true 2d a 4d (or unrelated if 3d is your cutoff). As you get more distant it becomes increasingly more difficult to distinguish inference ranges for more distant degrees. Tools that screw up the "easy ones" should get penalized more harshly, right?

If we accept this argument (again, please rebut!), how can we try to encode this into some criterion? Here's a first shot: RMSE of the reciprocals!. There's absolutely zero mathematical or theoretical underpinning to this, other than the intuition that (1/1 - 1/2) is worse than (1/2 - 1/3) which is worse than (1/3 - 1/4), and so on.

It sort of works, I think? The 1D to 2D gives a higher RMSE than the 2d to 3d, than the 3d to 4d. The two jump 1d to 3d gives a worse RMSE than the one jump 1D to 2D, and also worse than the two jump 2D to 4D.

I feel like this is too simple, and I'm missing something elementary.

rmse_recip <- function(Target, Predicted) sqrt(mean((1/Target-1/Predicted)^2))

# one jumps
rmse_recip(Target=rep(1, 10), Predicted=rep(2,10))
#> [1] 0.5
rmse_recip(Target=rep(2, 10), Predicted=rep(3,10))
#> [1] 0.1666667
rmse_recip(Target=rep(3, 10), Predicted=rep(4,10))
#> [1] 0.08333333

# two jumps
rmse_recip(Target=rep(1, 10), Predicted=rep(3,10))
#> [1] 0.6666667
rmse_recip(Target=rep(2, 10), Predicted=rep(4,10))
#> [1] 0.25

# Three jumps
rmse_recip(Target=rep(1, 10), Predicted=rep(4,10))
#> [1] 0.75

This is all with toy data. We'd need to see how this performs in real life. And, the function works fine here, but I'd want to modify it to be (a) more automated/strict about choosing a numeric value for unrelated (D+1, where D=max inferred numeric degree), and (b) more flexible about how unrelateds are represented. Prior to that factoring line, unrelateds are NA, then get turned into Unrelated for convenience later on.

A thing that would throw a wrench into this would be the presence of identical twins / duplicates / identical samples (degree=0). The reciprocal would throw an Inf.

Oh, and one other small observation. In your original code above, looks like you're taking the square of the mean differences instead of the mean of the squared differences. I don't think it matters (?) but may want to fix that.

Oh, and one last thing. With respect to identities (degree==0), an easy hack is to add 0.5 to the pred/target vectors of integers. The same trend described above still works, and doesn't bomb when you have a 0-degree identity.

rmse_recip <- function(Target, Predicted) sqrt(mean((1/(Target+.5)-1/(Predicted+.5))^2))

# one jumps
rmse_recip(Target=rep(0, 10), Predicted=rep(1,10))
#> [1] 1.333333
rmse_recip(Target=rep(1, 10), Predicted=rep(2,10))
#> [1] 0.2666667
rmse_recip(Target=rep(2, 10), Predicted=rep(3,10))
#> [1] 0.1142857
rmse_recip(Target=rep(3, 10), Predicted=rep(4,10))
#> [1] 0.06349206

# two jumps
rmse_recip(Target=rep(1, 10), Predicted=rep(3,10))
#> [1] 0.3809524
rmse_recip(Target=rep(2, 10), Predicted=rep(4,10))
#> [1] 0.1777778

# Three jumps
rmse_recip(Target=rep(1, 10), Predicted=rep(4,10))
#> [1] 0.4444444

RSME seems like excellent option. If I could suggest a potential tweak:

What if instead we measure error as a function of the respective kinship coefficient, rather than the estimated degree of relatedness. I suggest this for a few reasons:

1) We are already seeing from Stephen's work last week that higher error rate data are leading to increased error in estimated degree for all of the tools, but (at least for AKT) the distribution of scores within a particular degree are still apparently normal, and have separation. In other words, the mean kinship coefficient for each given degree is lower than expected, but each group is still distinct.

2) If we use the mean square error (MSE) of measured versus predicted k, we can compare any of the tools without concern for the calculations for the degree of relatedness estimation algorithm. Given the discussions this week about the map files, and what we suspect about changes between populations, this may be of more value in the long run.

3) I am kind of hopeful that this may lead us to finding functions that would allow us to find a method to 'correct' k for comparisons of low quality data.

☝️ @genignored I completely agree - RMSE on the simulated versus detected kinship coefficients is much I think. THis is what August was suggesting last week. That is, not RMSE of measured versus predicted k, but measured versus actual k, which is the subject of #37. Yes, we get the predicted k from ped2kinpair(), but that's what the k should be for that type of relationship, not what was actually simulated. But I still think it's of value to do what Pete's suggesting here, and add that to the confusion matrix function itself (suggestion - add it to the $Accuracy object? Or as another top-level object in the list it returns?). This will give us a better measure on categorical data where categories are ordinal.

signaturescience / skater

more robust accuracy metrics for confusion matrix counts #38