Closed zkamvar closed 5 years ago
This is not ready yet for review because we still have a fencepost issue:
x <- sample(rep(c("a", "b", "c", "d"), c(3, 3, 2, 2)))
# factor with reverse levels
xf <- factor(x, c("d", "c", "b", "a"))
# defaults to reporting first value in a tie
linelist::top_values(x, n = 1)
#> [1] "a" "other" "other" "other" "other" "other" "other" "a"
#> [9] "a" "other"
linelist::top_values(xf, n = 1)
#> [1] other other other b other other b other other b
#> Levels: b other
user can change the ties method
linelist::top_values(x, n = 1, ties.method = "last")
#> [1] "other" "other" "other" "b" "other" "other" "b" "other"
#> [9] "other" "b"
linelist::top_values(xf, n = 1, ties.method = "last")
#> [1] a other other other other other other a a other
#> Levels: a other
however, there still appears to be a fencepost issue :(
linelist::top_values(x, n = 3)
#> [1] "a" "c" "d" "b" "d" "c" "b" "a" "a" "b"
linelist::top_values(xf, n = 3)
#> [1] a c d b d c b a a b
#> Levels: d c b a
Created on 2019-09-23 by the reprex package (v0.3.0)
It is now ready for review. Here is a reprex showing how it works with a character string and factors with reverse levels.
x <- sample(rep(c("a", "b", "c", "d"), c(3, 3, 2, 2)))
factor with reverse levels
xf <- factor(x, c("d", "c", "b", "a"))
defaults to reporting first value in a tie
linelist::top_values(x, n = 1)
#> [1] "a" "other" "other" "other" "other" "other" "a" "a"
#> [9] "other" "other"
linelist::top_values(xf, n = 1)
#> [1] other b b other other other other other b other
#> Levels: b other
user can change the ties method
linelist::top_values(x, n = 1, ties.method = "last")
#> [1] "other" "b" "b" "other" "other" "other" "other" "other"
#> [9] "b" "other"
linelist::top_values(xf, n = 1, ties.method = "last")
#> [1] a other other other other other a a other other
#> Levels: a other
selecting n - 1 with ties removes the last value by default
linelist::top_values(x, n = 3)
#> [1] "a" "b" "b" "other" "c" "c" "a" "a"
#> [9] "b" "other"
linelist::top_values(xf, n = 3)
#> [1] a b b d other other a a b d
#> Levels: d b a other
changing the ties method changes which value is retained
linelist::top_values(x, n = 3, ties.method = "last")
#> [1] "a" "b" "b" "d" "other" "other" "a" "a"
#> [9] "b" "d"
linelist::top_values(xf, n = 3, ties.method = "last")
#> [1] a b b other c c a a b other
#> Levels: c b a other
Created on 2019-09-23 by the reprex package (v0.3.0)
Thanks a bunch for this Zhian! I think it would be useful for top_values to throw a warning if there's a tie and some (but not all) of the tied values get thrown into the "others" category. Perhaps something along the lines of:
"Warning: there were a number of tied levels, first value chosen by default"
(or something less vague and better written!)
Also can we reconsider re-opening https://github.com/reconhub/linelist/issues/91? I'm not 100% sure the problem (which isn't related to ties) is described here (unless I'm missing something).
@zkamvar what do you think?
Note that in the example @thibautjombart gave:
top_values(factor(c(1,1,2)), n = 1)
# [1] 1 1 2
# Levels: 1 2
was the output from the using the version of linelist on github currently, whilst:
# [1] 1 1 other
# Levels: 1 other
was the output I got when running the same top_values(factor(c(1,1,2)), n = 1)
using the version on CRAN currently.
Also can we reconsider re-opening #91? I'm not 100% sure the problem (which isn't related to ties) is described here (unless I'm missing something).
You are correct that #91 isn't described in #88. I was mistaken and closed it because it appeared to be similar to #89, which was closed in favor of #88.
That being said, #91 is actually caused by https://github.com/tidyverse/forcats/issues/184 where the developers of forcats decided to not replace a single factor level with the replacement since it's not really lumping anything. I fixed the issue and documented this behavior in the help page, so this PR will fix #91. Here's a reprex to prove it:
linelist::top_values(factor(c(1, 1, 2)), n = 1)
#> [1] 1 1 other
#> Levels: 1 other
Created on 2019-09-24 by the reprex package (v0.3.0)
Note that in the example @thibautjombart gave:
top_values(factor(c(1,1,2)), n = 1) # [1] 1 1 2 # Levels: 1 2
was the output from the using the version of linelist on github currently, whilst:
# [1] 1 1 other # Levels: 1 other
was the output I got when running the same
top_values(factor(c(1,1,2)), n = 1)
using the version on CRAN currently.
A point on terminology:
You can get the commit hash you are currently using with packageDescription("linelist")$GithubSHA1
, but the reprex package will auto-include this if you use reprex::reprex(si = TRUE)
. @thibautjombart gives a good example of including the commit number in https://github.com/reconhub/linelist/issues/91#issue-497043464.
Referencing by commit has really helps identify what specific part of the codebase you are using.
Thanks @zkamvar and @cwhittaker1000 (fresh blood! :) ) for looking into this. I like the version handling ties, but I would second a warning in the presence of ties, only when some are relabelled as "other"
. To expand on @cwhittaker1000 idea:
top_values(c("a", "a", "b", "b", "c"), 2)
should return, without warning:
"a" "a" "b" "b" "other"
and
top_values(c("a", "a", "b", "b", "c"), 1)
should return:
"a" "a" "other" "other" "other"
with a warning.
@cwhittaker1000, it is ready for your review. Here's a snippet of what the behavior is:
library("linelist")
x <- sample(rep(c("a", "b", "c", "d"), c(3, 3, 2, 2)))
# will give a warning if there is a tie
top_values(x, n = 1)
#> Warning: a tie among values (a, b) was broken by choosing the first value
#> [1] "other" "other" "other" "other" "a" "other" "other" "a"
#> [9] "a" "other"
top_values(x, n = 3)
#> Warning: a tie among values (c, d) was broken by choosing the first value
#> [1] "c" "other" "b" "other" "a" "b" "c" "a"
#> [9] "a" "b"
top_values(x, n = 3, ties.method = "last")
#> Warning: a tie among values (c, d) was broken by choosing the last value
#> [1] "other" "d" "b" "d" "a" "b" "other" "a"
#> [9] "a" "b"
# no ties gives no warning
top_values(c("a", "b", "a", "b", "c"), n = 2)
#> [1] "a" "b" "a" "b" "other"
Created on 2019-09-26 by the reprex package (v0.3.0)
@cwhittaker1000, @thibautjombart, please let me know now if you want this merged or if you want any other changes. My holiday starts tomorrow, so I won't be doing anything on this until 7 October.
Hey Zhian, firstly thanks so much for sorting this and sorry for the late reply! Looks great - I've just got one small comment. The use of the phrase "first value" in the Warning message works when there's only a two way tie, but isn't technically correct when there's a three way tie. Illustrated below:
x <- sample(rep(c("a", "b", "c", "d", "e", "f"), c(3, 3, 2, 2, 2, 1)))
top_values(x, n = 3)
[1] "other" "c" "b" "other" "a" "c" "other" "a" "a" "other" "other" "b" "b"
Warning message:
a tie among values (c, d, e) was broken by choosing the first value
top_values(x, n = 4)
[1] "other" "c" "b" "other" "a" "c" "d" "a" "a" "d" "other" "b" "b"
Warning message:
a tie among values (c, d, e) was broken by choosing the first value
I think for the latter, the warning message should read something like:
Warning message:
a tie among values (c, d, e) was broken by choosing the first two values
Although important to emphasise the behaviour's still correct (the code's outputting the result we're after) so absolutely no problem at all if you can't get to this before you head off on holiday! Sorry again for the late reply and hope you have a wonderful holiday :)
This will fix #88