Issue with top_values mishandling ties

cwhittaker1000 commented 5 years ago

In instances where the lowest ranked values are all tied for the number of occurrences, top_values returns them all, even if this means returning more than the n specified number of unique values.

This might ultimately be the desired behaviour (I guess one alternative might be to bundle all of them into Other but that also seems unsatisfactory), but I was wondering whether it might be worth throwing a warning highlighting this?

Here's an example of the behaviour I'm talking about:

library(linelist)
set.seed(10001)

# Create some data 100 in length, consisting of the numbers 1-10, with the numbers 1, 2 and 3 occurring with the same frequency.
fixed <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4)
random <- sample(4:10, 90, replace = TRUE, prob = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7))
category <- c(fixed, random)

# Create a dataframe of random normal draws, each associated with a number from 1-10
values <- rnorm(100, mean = 10, sd = 5)
test_data <- data.frame(cat = as.factor(category), val = values)
top_numbers <- top_values(test_data$cat, n = 8)
unique(top_numbers)

# Output of unique(top_numbers) has more than 8 levels! 
[1] 1  2  3  4  8  7  9  5  6  10
Levels: 1 2 3 4 5 6 7 8 9 10

(Note this also happens when the tied factors aren't the lowest ranked - the behaviour appears to be the product of the fact that the tied factors span and then surpass the specified n )

zkamvar commented 5 years ago

reprex from @thibautjombart:

  library("linelist")
  x <- c('a', 'b', 'a', 'b', 'c')
  top_values(x, 1)
  #> [1] "a" "b" "a" "b" "c"

^{Created on 2019-09-23 by the reprex package (v0.3.0)}

zkamvar commented 5 years ago

There are a couple issues at play here:

This function uses forcats::fct_lump() under the hood, which in turn uses rank() to order the levels. The default is min rank, which effectively just converts ties into their factor numbers, but doesn't really do anything to alleviate them and there appears to be a bug in forcats where it just returns the entire vector if it can't match what the user wants.

The solution is to add the ties.method parameter and set it to "first". Here's an example:

x <- sample(rep(c("a", "b", "c"), c(3, 3, 2)))
forcats::fct_lump(x, n = 1)
#> [1] a a a b c b c b
#> Levels: a b c
forcats::fct_lump(x, n = 1, ties.method = "first")
#> [1] a     a     a     Other Other Other Other Other
#> Levels: a Other

^{Created on 2019-09-23 by the reprex package (v0.3.0)}

I'll make a PR within the hour

cwhittaker1000 commented 5 years ago

Is it possible to add this as an argument for top_values itself (rather than into fct_lump)? I'm just conscious that what users want might be different depending on the circumstances (e.g. in some cases you might want to return all the ties, in others you might not?). Or do you think that's inviting more ambiguity/difficulty?

cwhittaker1000 commented 5 years ago

Also see an instance where top_values can't match what the user wants but instead of returning the entire vector as it, it still converts some of the values (1 in this instance) to other. Not sure if this is relevant, but wanted to flag it in light of what you said about "there appears to be a bug in forcats where it just returns the entire vector if it can't match what the user wants."

x <- c(1, 2, 2, 3, 3, 4, 4, 4)
> top_values(as.factor(x), n = 2)
[1] other 2     2     3     3     4     4     4    
Levels: 2 3 4 other

zkamvar commented 5 years ago

Hi @cwhittaker1000,

Sorry for the confusion, I'm working on a PR to include an explicit ties.method argument into top_values(), I wasn't trying to suggest that people should switch to using forcats directly.

For what it's worth, it's currently possible to include ties.method in the arguments before I update.

Ambiguity is definitely something we want to avoid. We can definitely attempt to tool against different scenarios within reason, but I would ask you to be specific about what you expect.

I still am tooling around with writing tests at the moment, but this is what the function returns based on your example:

  library(linelist)
set.seed(10001)

# Create some data 100 in length, consisting of the numbers 1-10, with the numbers 1, 2 and 3 occurring with the same frequency.
fixed <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4)
random <- sample(4:10, 90, replace = TRUE, prob = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7))
category <- c(fixed, random)

# Create a dataframe of random normal draws, each associated with a number from 1-10
values <- rnorm(100, mean = 10, sd = 5)
test_data <- data.frame(cat = as.factor(category), val = values)
top_numbers <- top_values(test_data$cat, n = 8)
unique(top_numbers)
#> [1] 1     2     other 8     7     9     5     6     10   
#> Levels: 1 2 5 6 7 8 9 10 other

^{Created on 2019-09-23 by the reprex package (v0.3.0)}

cwhittaker1000 commented 5 years ago

Hiya @zkamvar !

Apologies, I should have been (way) more specific! What you've said above clarifies and resolves my queries entirely, thanks! Looking forward to the output!

zkamvar commented 5 years ago

Hiya @zkamvar !

Apologies, I should have been (way) more specific! What you've said above clarifies and resolves my queries entirely, thanks! Looking forward to the output!

No apologies necessary! You are doing great and were quite clear in your expectations from the get-go.

zkamvar commented 5 years ago

note to all, I am currently in progress on this issue with #90. I will ping @thibautjombart and @cwhittaker1000 when it is ready.

zkamvar commented 5 years ago

Hi @cwhittaker1000 and @thibautjombart, the PR is now ready for review. You can install it via

remotes::install_github("reconhub/linelist#90")

reconhub / linelist

Issue with top_values mishandling ties #88