socialresearchcentre / testdat

Data Unit Testing for R
https://socialresearchcentre.github.io/testdat/
Other
7 stars 1 forks source link

Deprecate expect_similar() #18

Closed kinto-b closed 3 years ago

kinto-b commented 3 years ago

Suppose we compare the following,

df1 <- data.frame(key = 1:100, binom = rbinom(100, 1, 0.5))
df2 <- data.frame(key = 1:100, binom = rbinom(100, 1, 0.9))
testdat::expect_similar(binom, df2, binom, data = df1, min = 0)

How it currently works

What does expect_similar() do internally? First it does this

https://github.com/socialresearchcentre/testdat/blob/9f3ef727756836853e3d8e2993428ff5848c3b9e/R/expect-datacomp.R#L33-L34

which yields

Browse[2]> data_tb
# A tibble: 2 x 2
  binom  freq
  <int> <int>
1     0    56
2     1    44
Browse[2]> data2_tb
# A tibble: 2 x 2
  binom  freq
  <int> <int>
1     0     8
2     1    92

Then it does this

https://github.com/socialresearchcentre/testdat/blob/9f3ef727756836853e3d8e2993428ff5848c3b9e/R/expect-datacomp.R#L36-L40

which yields

Browse[2]> act$result
# A tibble: 2 x 5
  binom freq.x freq.y prop_diff pass 
  <int>  <int>  <int>     <dbl> <lgl>
1     0     56      8     0.857 FALSE 
2     1     44     92     1.09  FALSE 

How I thought it would work

But I was expecting it to run the comparison in this way:

 data_tb  <- data  %>% group_by(!!var)  %>% summarise(freq = n(), .groups = "drop") %>% mutate(prop = freq/sum(freq))
 data2_tb <- data2 %>% group_by(!!var2) %>% summarise(freq = n(), .groups = "drop") %>% mutate(prop = freq/sum(freq))

  by_var <- structure(as_name(var), names = as_name(var2))
  act$result <-
    left_join(data_tb, data2_tb, by = by_var) %>%
    mutate(prop_diff = abs(.data$prop.x - .data$prop.y),
           pass = .data$prop_diff < threshold | .data$freq.x < min | .data$freq.y < min)

which yields

Browse[2]> data_tb
# A tibble: 2 x 3
  binom  freq  prop
  <int> <int> <dbl>
1     0    56  0.56
2     1    44  0.44

Browse[2]> data2_tb
# A tibble: 2 x 3
  binom  freq  prop
  <int> <int> <dbl>
1     0     8  0.08
2     1    92  0.92

Browse[2]> act$result
# A tibble: 2 x 7
  binom freq.x prop.x freq.y prop.y prop_diff pass 
  <int>  <int>  <dbl>  <int>  <dbl>     <dbl> <lgl>
1     0     56   0.56      8   0.08      0.48 FALSE
2     1     44   0.44     92   0.92      0.48 FALSE
gorcha commented 3 years ago

To answer the issue title - no :P

This was a hacky rough attempt to incorporate the original frequency (to allow more leeway for high frequency responses, since for e.g. a jump from 2 to 12 percent is usually very different in checking than a jump from 72 to 82).

Should probably use an actual statistical measure instead.

kinto-b commented 3 years ago

Yeah, good point.

What about chisq.test()?

kinto-b commented 3 years ago

Coming back to this, chisq.test() is very sensitive to differences. For instance:

df1 <- data.frame(
  a = sample(1:5, 100000, TRUE),
  b = sample(c(rep(1:5, 5), 1:3), 100000, TRUE),
  c = sample(c(rep(1:5, 25), 1:3), 100000, TRUE),
  d = sample(c(rep(1:5, 125), 1:3), 100000, TRUE)  
 )

df1 %>% group_by(level = a) %>% summarise(n_a = n()) %>% 
  left_join(
    df1 %>% group_by(level = b) %>% summarise(n_b = n()), "level"
  ) %>% 
  left_join(
    df1 %>% group_by(level = c) %>% summarise(n_c = n()), "level"
  ) %>% 
  left_join(
    df1 %>% group_by(level = d) %>% summarise(n_d = n()), "level"
  )

#> # A tibble: 5 x 5
#>   level   n_a   n_b   n_c   n_d
#>   <int> <int> <int> <int> <int>
#> 1     1 20180 21234 20241 20026
#> 2     2 19932 21382 20398 20159
#> 3     3 19820 21494 20255 20050
#> 4     4 20031 17956 19654 19905
#> 5     5 20037 17934 19452 19860

chisq.test(table(df1$a), p = table(df1$b), rescale.p = TRUE)

#>  Chi-squared test for given probabilities
#> 
#> data:  table(df1$a)
#> X-squared = 767.42, df = 4, p-value < 2.2e-16

chisq.test(table(df1$a), p = table(df1$c), rescale.p = TRUE)

#>  Chi-squared test for given probabilities
#> 
#> data:  table(df1$a)
#> X-squared = 44.997, df = 4, p-value = 3.982e-09

chisq.test(table(df1$a), p = table(df1$d), rescale.p = TRUE)

#>  Chi-squared test for given probabilities
#> 
#> data:  table(df1$a)
#> X-squared = 8.7539, df = 4, p-value = 0.06755

If we consider the p-values, we would never say that a is similar to b or c. But to my eyes c looks very similar to a.

We could potentially use the chi-squared statistic directly. But a small chi-square value doesn't necessarily indicate that the distributions are similar, only that we can't confidently tell them apart: that could be because they are similar or because there aren't many data points.

kinto-b commented 3 years ago

Spoke to Andrew about this and he recommended using chi-square.

I'm now questioning this function. It might be best to leave similarity testing to users given that it's fairly hairy

kinto-b commented 3 years ago

@wilcoxa @tonoplast RE: similarity testing discussion this morning

kinto-b commented 3 years ago

@wilcoxa I'm leaning towards deprecating this one. I've slapped an experimental badge on it anyway.

Once this is sorted out, 0.2.0 will be ready to review/merge/release. I think the other issues that are currently outstanding can wait

wilcoxa commented 3 years ago

@kinto-b Yeah experimental is good for the moment - worth keeping track of the progress made at the very least. There's definitely a need for this type of function, but needs some work to make this useful in the wild.

kinto-b commented 3 years ago

@gorcha I think this one should be deprecated before CRAN-ing

gorcha commented 3 years ago

Yep, agreed