reconhub / linelist

An R package to import, clean, and store case data
https://www.repidemicsconsortium.org/linelist
Other
25 stars 5 forks source link

incorporating capacity for top_values to define levels using only a subset of data #92

Closed cwhittaker1000 closed 5 years ago

cwhittaker1000 commented 5 years ago

Would it be possible to add in functionality to top_values that enables the ranking of occurrences to be determined based only on a subset of the data? The idea would be that top_values would establish the rankings based only on this subset, but then use these rankings for the entirety of the inputted data.

Something along the lines of:

x <- c("a", "a", "a", "b", "b", "c")
top_values(x, n = 1, subset = 4:6)
other  other other b   b  other
Levels: b other

or similarly, if the data you're using contains dates, something like:

date <- "2019-09-20"
top_values(y, n, subset = date_report > a_date)

(where date_report would be a vector and a_date would be a single date!)

zkamvar commented 5 years ago

Would it be possible to add in functionality to top_values that enables the ranking of occurrences to be determined based only on a subset of the data? The idea would be that top_values would establish the rankings based only on this subset, but then use these rankings for the entirety of the inputted data.

Something along the lines of:

x <- c("a", "a", "a", "b", "b", "c")
top_values(x, n = 1, subset = 4:6)
other  other other b   b  other
Levels: b other

or similarly, if the data you're using contains dates, something like:

date <- "2019-09-20"
top_values(y, n, subset = date_report > a_date)

(where date_report would be a vector and a_date would be a single date!)

Can you provide a specific use-case for subsetting? Also, if you have something like:

x <- c("a", "a", "a", "b", "b", "c", "b")
top_values(x, n = 1, subset = 4:6)

What would that last "b" be? Would it stay "b" or would it automatically become "other" because it was outside the subset?

Regarding the dates: top_values() only works on characters and factors, so dates won't work in there (though do let me know if I'm missing something).

cwhittaker1000 commented 5 years ago

Sorry Zhian I haven't explained that particularly well! The important thing to emphasise is that the only function of the subset is to specify the elements over which the rankings should be calculated. These rankings would then be applied to the entirety of the dataset.

So in terms of:

x <- c("a", "a", "a", "b", "b", "c", "b")
top_values(x, n = 1, subset = 4:6)

we'd expect that last "b" to be a "b" not "other". Elements 4:6 (the subset) would only used to determine the rankings of each unique character/factor (with any factors absent from the subset but present in the full vector i.e. "a" implicitly occurring with frequency 0). So in the case above, only c("b", "b", "c") would be used to work out the rankings of the characters. "b" would be the most frequent, and because we've asked for only the most frequent to be retained, all other, non-"b" values (including "a", despite its absence from the subset) would be regarded as "other" in the full dataset.

Here's a use case: we have a dataset containing information on each case occurring during an outbreak. One piece of information is the health zone that case was reported in. The other is the date on which that case occurred. We'd like to identify the health zones with the most cases, but over a recent time period, say the past 21 days. However, we'd also like to be able to plot the number of cases in these recently active health zones since the beginning of the outbreak and so don't want to filter our dataset to retain only the recent cases.

For example, where x is a dataframe containing a column of dates called date_report and a column of health zone names called health_zone, I envisaged doing something along the following:

a_date <- "2019-09-03" # 21 days prior to today
x_with_top_zones <- x %>%
       mutate(recent_active_zones = top_values(health_zone, n = 1, subset = date_report > a_date))

in this case then date_report > a_date would specify a logical that establishes the subset of the data (x) we want to calculate the frequency of occurrence of different health zones!

Hopefully that's a little clearer but let me know if anything's still unclear!

zkamvar commented 5 years ago

I believe the solution to this would be to insert a catch into the top_values() code that would check if subset is not NULL and then re-call top_values() on the subset, and then using the levels from that to relevel the original vector and return. Somewhat like this:

top_values.factor <- function(x, n, other = "other", ..., subset = NULL) {

# cromulence checks...

if (!is.null(subset)) {
  # subset and call the function on the subset
  y <- x[subset]
  y <- top_values(y, n, ..., subset = NULL)

  # find the levels that were dropped in the subset and replace them with other
  other_levels <- setdiff(levels(x), levels(y))
  out <- forcats::fct_other(x, drop = other_levels, other_level = other)
  return(out)
}

# check for top values, give warnings, errors, etc. ...

return(out)

}
thibautjombart commented 5 years ago

I agree with the proposed solution and will implement it.

cwhittaker1000 commented 5 years ago

I've picked up another issue that I'm struggling to diagnose. top_values appears to fail when presented with a subset vector not containing all TRUEs, and with a vector of characters not in alphabetical order:

library(linelist)

# Simple Example - Both run although weird warning
x_test_bad <- c("a", "b", "b")
x_test_good <- c("b", "b", "b")
x_subset <- c(FALSE, TRUE, TRUE)
top_values(x = x_test_bad, n = 1, subset = x_subset)
#> [1] "other" "b"     "b"
top_values(x = x_test_good, n = 1, subset = x_subset)
#> Warning: Unknown levels in `f`: other
#> [1] "b" "b" "b"

# Simple Example (characters reversed) - One Runs, One Fails
x_test_bad <- c("b", "a", "a")
x_test_good <- c("b", "b", "b")
x_subset <- c(TRUE, TRUE, TRUE)
top_values(x = x_test_bad, n = 1, subset = x_subset)
#> [1] "other" "a"     "a"
top_values(x = x_test_good, n = 1, subset = x_subset)
#> Warning: Unknown levels in `f`: other
#> [1] "b" "b" "b"

# Simple Example (except characters reversed) - One Runs, One Fails
x_test_bad <- c("b", "a", "a")
x_test_good <- c("b", "b", "b")
x_subset <- c(FALSE, TRUE, TRUE)
top_values(x = x_test_bad, n = 1, subset = x_subset)
#> Error in names(object) <- nm: 'names' attribute [2] must be the same length as the vector [1]
top_values(x = x_test_good, n = 1, subset = x_subset)
#> Warning: Unknown levels in `f`: other
#> [1] "b" "b" "b"

# Another example with non-alphabetical ordering - One Runs, One Fails
x_test_bad <- c("vuhovi", "beni", "beni")
x_test_good <- c("beni", "beni", "beni")
x_subset <- c(FALSE, TRUE, TRUE)
top_values(x = x_test_bad, n = 1, subset = x_subset)
#> Error in names(object) <- nm: 'names' attribute [2] must be the same length as the vector [1]
top_values(x = x_test_good, n = 1, subset = x_subset)
#> Warning: Unknown levels in `f`: other
#> [1] "beni" "beni" "beni"

Created on 2019-10-01 by the reprex package (v0.2.1)

thibautjombart commented 5 years ago

I think I can boil it down to the following example, without subset:

## case 1
top_values(factor(c('a', 'b'))[-1], n = 1)
#> [1] b
#> Levels: b other

## case 2
top_values(factor(c('b', 'a'))[-1], n = 1)
#> Error in names(object) <- nm : 
#>   'names' attribute [2] must be the same length as the vector [1]

Two issues here, I think partly related to ghost levels:

thibautjombart commented 5 years ago

Closing this as all functionalities are implemented and tested. The spurious level issue is moved to https://github.com/reconhub/linelist/issues/96