reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

add ties.method param to top_values() #90

Closed zkamvar closed 5 years ago

zkamvar commented 5 years ago

This will fix #88

zkamvar commented 5 years ago

This is not ready yet for review because we still have a fencepost issue:


x <- sample(rep(c("a", "b", "c", "d"), c(3, 3, 2, 2))) 
# factor with reverse levels 
xf <- factor(x, c("d", "c", "b", "a"))

# defaults to reporting first value in a tie 
linelist::top_values(x, n = 1) 
#>  [1] "a"     "other" "other" "other" "other" "other" "other" "a"    
#>  [9] "a"     "other"
linelist::top_values(xf, n = 1) 
#>  [1] other other other b     other other b     other other b    
#> Levels: b other

user can change the ties method

linelist::top_values(x, n = 1, ties.method = "last") 
#>  [1] "other" "other" "other" "b"     "other" "other" "b"     "other"
#>  [9] "other" "b"
linelist::top_values(xf, n = 1, ties.method = "last") 
#>  [1] a     other other other other other other a     a     other
#> Levels: a other

however, there still appears to be a fencepost issue :(

linelist::top_values(x, n = 3) 
#>  [1] "a" "c" "d" "b" "d" "c" "b" "a" "a" "b"
linelist::top_values(xf, n = 3)
#>  [1] a c d b d c b a a b
#> Levels: d c b a

Created on 2019-09-23 by the reprex package (v0.3.0)

zkamvar commented 5 years ago

It is now ready for review. Here is a reprex showing how it works with a character string and factors with reverse levels.


x <- sample(rep(c("a", "b", "c", "d"), c(3, 3, 2, 2))) 

factor with reverse levels

xf <- factor(x, c("d", "c", "b", "a"))

defaults to reporting first value in a tie

linelist::top_values(x, n = 1) 
#>  [1] "a"     "other" "other" "other" "other" "other" "a"     "a"    
#>  [9] "other" "other"
linelist::top_values(xf, n = 1) 
#>  [1] other b     b     other other other other other b     other
#> Levels: b other

user can change the ties method

linelist::top_values(x, n = 1, ties.method = "last") 
#>  [1] "other" "b"     "b"     "other" "other" "other" "other" "other"
#>  [9] "b"     "other"
linelist::top_values(xf, n = 1, ties.method = "last") 
#>  [1] a     other other other other other a     a     other other
#> Levels: a other

selecting n - 1 with ties removes the last value by default

linelist::top_values(x, n = 3)
#>  [1] "a"     "b"     "b"     "other" "c"     "c"     "a"     "a"    
#>  [9] "b"     "other"
linelist::top_values(xf, n = 3) 
#>  [1] a     b     b     d     other other a     a     b     d    
#> Levels: d b a other

changing the ties method changes which value is retained

linelist::top_values(x, n = 3, ties.method = "last") 
#>  [1] "a"     "b"     "b"     "d"     "other" "other" "a"     "a"    
#>  [9] "b"     "d"
linelist::top_values(xf, n = 3, ties.method = "last")
#>  [1] a     b     b     other c     c     a     a     b     other
#> Levels: c b a other

Created on 2019-09-23 by the reprex package (v0.3.0)

cwhittaker1000 commented 5 years ago

Thanks a bunch for this Zhian! I think it would be useful for top_values to throw a warning if there's a tie and some (but not all) of the tied values get thrown into the "others" category. Perhaps something along the lines of:

"Warning: there were a number of tied levels, first value chosen by default"

(or something less vague and better written!)

Also can we reconsider re-opening https://github.com/reconhub/linelist/issues/91? I'm not 100% sure the problem (which isn't related to ties) is described here (unless I'm missing something).

@zkamvar what do you think?

Note that in the example @thibautjombart gave:

top_values(factor(c(1,1,2)), n = 1)
# [1] 1 1 2
# Levels: 1 2

was the output from the using the version of linelist on github currently, whilst:

# [1] 1 1 other
# Levels: 1 other

was the output I got when running the same top_values(factor(c(1,1,2)), n = 1) using the version on CRAN currently.

zkamvar commented 5 years ago

Also can we reconsider re-opening #91? I'm not 100% sure the problem (which isn't related to ties) is described here (unless I'm missing something).

You are correct that #91 isn't described in #88. I was mistaken and closed it because it appeared to be similar to #89, which was closed in favor of #88.

That being said, #91 is actually caused by https://github.com/tidyverse/forcats/issues/184 where the developers of forcats decided to not replace a single factor level with the replacement since it's not really lumping anything. I fixed the issue and documented this behavior in the help page, so this PR will fix #91. Here's a reprex to prove it:

linelist::top_values(factor(c(1, 1, 2)), n = 1)
#> [1] 1     1     other
#> Levels: 1 other

Created on 2019-09-24 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.1 (2019-07-05) #> os Ubuntu 18.04.2 LTS #> system x86_64, linux-gnu #> ui X11 #> language en_US:en #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/London #> date 2019-09-24 #> #> ─ Packages ────────────────────────────────────────────────────────────── #> package * version date lib #> assertthat 0.2.1 2019-03-21 [1] #> backports 1.1.4 2019-04-10 [1] #> callr 3.3.1 2019-07-18 [1] #> cli 1.1.0 2019-03-19 [1] #> crayon 1.3.4 2017-09-16 [1] #> desc 1.2.0 2018-05-01 [1] #> devtools 2.2.0 2019-09-07 [1] #> digest 0.6.21 2019-09-20 [1] #> DT 0.9 2019-09-17 [1] #> ellipsis 0.3.0 2019-09-20 [1] #> evaluate 0.14 2019-05-28 [1] #> forcats 0.4.0 2019-02-17 [1] #> fs 1.3.1 2019-05-06 [1] #> glue 1.3.1 2019-03-12 [1] #> highr 0.8 2019-03-20 [1] #> htmltools 0.3.6 2017-04-28 [1] #> htmlwidgets 1.3 2018-09-30 [1] #> knitr 1.25 2019-09-18 [1] #> linelist 0.0.36.9000 2019-09-24 [1] #> magrittr 1.5 2014-11-22 [1] #> memoise 1.1.0 2017-04-21 [1] #> pkgbuild 1.0.5 2019-08-26 [1] #> pkgload 1.0.2 2018-10-29 [1] #> prettyunits 1.0.2 2015-07-13 [1] #> processx 3.4.1 2019-07-18 [1] #> ps 1.3.0 2018-12-21 [1] #> R6 2.4.0 2019-02-14 [1] #> Rcpp 1.0.2 2019-07-25 [1] #> remotes 2.1.0 2019-06-24 [1] #> rlang 0.4.0 2019-06-25 [1] #> rmarkdown 1.15 2019-08-21 [1] #> rprojroot 1.3-2 2018-01-03 [1] #> sessioninfo 1.1.1 2018-11-05 [1] #> stringi 1.4.3 2019-03-12 [1] #> stringr 1.4.0 2019-02-10 [1] #> testthat 2.2.1 2019-07-25 [1] #> usethis 1.5.1.9000 2019-07-08 [1] #> withr 2.1.2.9000 2019-07-08 [1] #> xfun 0.9 2019-08-21 [1] #> yaml 2.2.0 2018-07-25 [1] #> source #> CRAN (R 3.5.3) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.5.3) #> CRAN (R 3.5.1) #> CRAN (R 3.6.1) #> CRAN (R 3.6.1) #> CRAN (R 3.6.1) #> CRAN (R 3.6.1) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.5.2) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.5.3) #> CRAN (R 3.5.1) #> CRAN (R 3.5.1) #> CRAN (R 3.6.1) #> Github (reconhub/linelist@178e8bf) #> CRAN (R 3.5.1) #> CRAN (R 3.5.1) #> CRAN (R 3.6.1) #> CRAN (R 3.5.1) #> CRAN (R 3.5.1) #> CRAN (R 3.6.1) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.5.1) #> CRAN (R 3.5.1) #> CRAN (R 3.5.3) #> CRAN (R 3.5.2) #> CRAN (R 3.6.1) #> Github (r-lib/usethis@a31b812) #> local #> CRAN (R 3.6.1) #> CRAN (R 3.5.1) #> #> [1] /home/zkamvar/Documents/RLibrary #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```
zkamvar commented 5 years ago

Note that in the example @thibautjombart gave:

top_values(factor(c(1,1,2)), n = 1)
# [1] 1 1 2
# Levels: 1 2

was the output from the using the version of linelist on github currently, whilst:

# [1] 1 1 other
# Levels: 1 other

was the output I got when running the same top_values(factor(c(1,1,2)), n = 1) using the version on CRAN currently.

A point on terminology:

  1. linelist is currently not on CRAN
  2. When referring to the github version, it's best to refer to the specific commit hash at which you are testing the code. For example, the master branch is at 58ed4c745ef6ee4799e0912a5704a1176c084db2, but this PR is currently at 178e8bf94b92cd5efcb6817927a134f5f7f6d658

You can get the commit hash you are currently using with packageDescription("linelist")$GithubSHA1, but the reprex package will auto-include this if you use reprex::reprex(si = TRUE). @thibautjombart gives a good example of including the commit number in https://github.com/reconhub/linelist/issues/91#issue-497043464.

Referencing by commit has really helps identify what specific part of the codebase you are using.

thibautjombart commented 5 years ago

Thanks @zkamvar and @cwhittaker1000 (fresh blood! :) ) for looking into this. I like the version handling ties, but I would second a warning in the presence of ties, only when some are relabelled as "other". To expand on @cwhittaker1000 idea:

top_values(c("a", "a", "b", "b", "c"), 2)

should return, without warning:

"a"  "a"  "b"  "b"  "other"

and

top_values(c("a", "a", "b", "b", "c"), 1)

should return:

"a"  "a"  "other"  "other"  "other"

with a warning.

zkamvar commented 5 years ago

@cwhittaker1000, it is ready for your review. Here's a snippet of what the behavior is:

library("linelist")
x <- sample(rep(c("a", "b", "c", "d"), c(3, 3, 2, 2))) 

# will give a warning if there is a tie
top_values(x, n = 1)
#> Warning: a tie among values (a, b) was broken by choosing the first value
#>  [1] "other" "other" "other" "other" "a"     "other" "other" "a"    
#>  [9] "a"     "other"

top_values(x, n = 3)
#> Warning: a tie among values (c, d) was broken by choosing the first value
#>  [1] "c"     "other" "b"     "other" "a"     "b"     "c"     "a"    
#>  [9] "a"     "b"

top_values(x, n = 3, ties.method = "last")
#> Warning: a tie among values (c, d) was broken by choosing the last value
#>  [1] "other" "d"     "b"     "d"     "a"     "b"     "other" "a"    
#>  [9] "a"     "b"

# no ties gives no warning
top_values(c("a", "b", "a", "b", "c"), n = 2)
#> [1] "a"     "b"     "a"     "b"     "other"

Created on 2019-09-26 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.1 (2019-07-05) #> os macOS Mojave 10.14.6 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/London #> date 2019-09-26 #> #> ─ Packages ────────────────────────────────────────────────────────────── #> package * version date lib #> assertthat 0.2.1 2019-03-21 [1] #> backports 1.1.4 2019-04-10 [1] #> callr 3.3.2 2019-09-22 [1] #> cli 1.1.0 2019-03-19 [1] #> crayon 1.3.4 2017-09-16 [1] #> desc 1.2.0 2018-05-01 [1] #> devtools 2.2.1 2019-09-24 [1] #> digest 0.6.21 2019-09-20 [1] #> ellipsis 0.3.0 2019-09-20 [1] #> evaluate 0.14 2019-05-28 [1] #> forcats 0.4.0 2019-02-17 [1] #> fs 1.3.1 2019-05-06 [1] #> glue 1.3.1 2019-03-12 [1] #> highr 0.8 2019-03-20 [1] #> htmltools 0.3.6 2017-04-28 [1] #> knitr 1.25 2019-09-18 [1] #> linelist * 0.0.36.9000 2019-09-26 [1] #> magrittr 1.5 2014-11-22 [1] #> memoise 1.1.0.9000 2019-07-21 [1] #> pkgbuild 1.0.5 2019-08-26 [1] #> pkgload 1.0.2 2018-10-29 [1] #> prettyunits 1.0.2 2015-07-13 [1] #> processx 3.4.1 2019-07-18 [1] #> ps 1.3.0 2018-12-21 [1] #> R6 2.4.0 2019-02-14 [1] #> Rcpp 1.0.2 2019-07-25 [1] #> remotes 2.1.0 2019-06-24 [1] #> rlang 0.4.0 2019-06-25 [1] #> rmarkdown 1.15 2019-08-21 [1] #> rprojroot 1.3-2 2018-01-03 [1] #> sessioninfo 1.1.1 2018-11-05 [1] #> stringi 1.4.3 2019-03-12 [1] #> stringr 1.4.0 2019-02-10 [1] #> testthat 2.2.1 2019-07-25 [1] #> usethis 1.5.1 2019-07-04 [1] #> withr 2.1.2 2018-03-15 [1] #> xfun 0.9 2019-08-21 [1] #> yaml 2.2.0 2018-07-25 [1] #> source #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> Github (reconhub/linelist@260b516) #> CRAN (R 3.6.0) #> Github (hadley/memoise@1650ad7) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> #> [1] /Users/zhian/R #> [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library ```
zkamvar commented 5 years ago

@cwhittaker1000, @thibautjombart, please let me know now if you want this merged or if you want any other changes. My holiday starts tomorrow, so I won't be doing anything on this until 7 October.

cwhittaker1000 commented 5 years ago

Hey Zhian, firstly thanks so much for sorting this and sorry for the late reply! Looks great - I've just got one small comment. The use of the phrase "first value" in the Warning message works when there's only a two way tie, but isn't technically correct when there's a three way tie. Illustrated below:

x <- sample(rep(c("a", "b", "c", "d", "e", "f"), c(3, 3, 2, 2, 2, 1)))
top_values(x, n = 3)
 [1] "other" "c"     "b"     "other" "a"     "c"     "other" "a"     "a"     "other" "other" "b"     "b"    
Warning message:
a tie among values (c, d, e) was broken by choosing the first value 

 top_values(x, n = 4)
[1] "other" "c"     "b"     "other" "a"     "c"     "d"     "a"     "a"     "d"     "other" "b"     "b"    
Warning message:
a tie among values (c, d, e) was broken by choosing the first value 

I think for the latter, the warning message should read something like:

Warning message:
a tie among values (c, d, e) was broken by choosing the first two values 

Although important to emphasise the behaviour's still correct (the code's outputting the result we're after) so absolutely no problem at all if you can't get to this before you head off on holiday! Sorry again for the late reply and hope you have a wonderful holiday :)