Closed cwhittaker1000 closed 5 years ago
reprex from @thibautjombart:
library("linelist")
x <- c('a', 'b', 'a', 'b', 'c')
top_values(x, 1)
#> [1] "a" "b" "a" "b" "c"
Created on 2019-09-23 by the reprex package (v0.3.0)
There are a couple issues at play here:
This function uses forcats::fct_lump()
under the hood, which in turn uses rank()
to order the levels. The default is min rank, which effectively just converts ties into their factor numbers, but doesn't really do anything to alleviate them and there appears to be a bug in forcats where it just returns the entire vector if it can't match what the user wants.
The solution is to add the ties.method
parameter and set it to "first". Here's an example:
x <- sample(rep(c("a", "b", "c"), c(3, 3, 2)))
forcats::fct_lump(x, n = 1)
#> [1] a a a b c b c b
#> Levels: a b c
forcats::fct_lump(x, n = 1, ties.method = "first")
#> [1] a a a Other Other Other Other Other
#> Levels: a Other
Created on 2019-09-23 by the reprex package (v0.3.0)
I'll make a PR within the hour
Is it possible to add this as an argument for top_values
itself (rather than into fct_lump)? I'm just conscious that what users want might be different depending on the circumstances (e.g. in some cases you might want to return all the ties, in others you might not?). Or do you think that's inviting more ambiguity/difficulty?
Also see an instance where top_values can't match what the user wants but instead of returning the entire vector as it, it still converts some of the values (1 in this instance) to other. Not sure if this is relevant, but wanted to flag it in light of what you said about "there appears to be a bug in forcats where it just returns the entire vector if it can't match what the user wants."
x <- c(1, 2, 2, 3, 3, 4, 4, 4)
> top_values(as.factor(x), n = 2)
[1] other 2 2 3 3 4 4 4
Levels: 2 3 4 other
Hi @cwhittaker1000,
Sorry for the confusion, I'm working on a PR to include an explicit ties.method
argument into top_values()
, I wasn't trying to suggest that people should switch to using forcats directly.
For what it's worth, it's currently possible to include ties.method
in the arguments before I update.
Ambiguity is definitely something we want to avoid. We can definitely attempt to tool against different scenarios within reason, but I would ask you to be specific about what you expect.
I still am tooling around with writing tests at the moment, but this is what the function returns based on your example:
library(linelist)
set.seed(10001)
# Create some data 100 in length, consisting of the numbers 1-10, with the numbers 1, 2 and 3 occurring with the same frequency.
fixed <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4)
random <- sample(4:10, 90, replace = TRUE, prob = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7))
category <- c(fixed, random)
# Create a dataframe of random normal draws, each associated with a number from 1-10
values <- rnorm(100, mean = 10, sd = 5)
test_data <- data.frame(cat = as.factor(category), val = values)
top_numbers <- top_values(test_data$cat, n = 8)
unique(top_numbers)
#> [1] 1 2 other 8 7 9 5 6 10
#> Levels: 1 2 5 6 7 8 9 10 other
Created on 2019-09-23 by the reprex package (v0.3.0)
Hiya @zkamvar !
Apologies, I should have been (way) more specific! What you've said above clarifies and resolves my queries entirely, thanks! Looking forward to the output!
Hiya @zkamvar !
Apologies, I should have been (way) more specific! What you've said above clarifies and resolves my queries entirely, thanks! Looking forward to the output!
No apologies necessary! You are doing great and were quite clear in your expectations from the get-go.
note to all, I am currently in progress on this issue with #90. I will ping @thibautjombart and @cwhittaker1000 when it is ready.
Hi @cwhittaker1000 and @thibautjombart, the PR is now ready for review. You can install it via
remotes::install_github("reconhub/linelist#90")
In instances where the lowest ranked values are all tied for the number of occurrences, top_values returns them all, even if this means returning more than the
n
specified number of unique values.This might ultimately be the desired behaviour (I guess one alternative might be to bundle all of them into Other but that also seems unsatisfactory), but I was wondering whether it might be worth throwing a warning highlighting this?
Here's an example of the behaviour I'm talking about:
(Note this also happens when the tied factors aren't the lowest ranked - the behaviour appears to be the product of the fact that the tied factors span and then surpass the specified
n
)