reconverse / incidence2

Compute and visualise incidence (reworking of the original incidence package)
https://www.reconverse.org/incidence2
Other
17 stars 2 forks source link

NA values in observations cause a value of NA to be returned #113

Closed avallecam closed 5 months ago

avallecam commented 7 months ago

Please place an "x" in all the boxes that apply


Please include a brief description of the problem with a code example:

When NA values are present in observations, counting cases at any time interval results in NA observations by default. This feature is not flexible to change compared with alternative code where na.rm = TRUE is possible to define for sum().

Feature request: possibly, make na.rm an option for the user?

library(incidence2)
#> Loading required package: grates
library(tidyverse)

# minimal example
minimal <- tibble(
  date = c(
    ymd(20200304),ymd(20200304),
    ymd(20200305),ymd(20200305)),
  region = c(
    "flatland_A", "flatland_B",
    "flatland_A", "flatland_B"
  ),
  cases_new = c(
    3,1,
    6,NA)
)

minimal
#> # A tibble: 4 × 3
#>   date       region     cases_new
#>   <date>     <chr>          <dbl>
#> 1 2020-03-04 flatland_A         3
#> 2 2020-03-04 flatland_B         1
#> 3 2020-03-05 flatland_A         6
#> 4 2020-03-05 flatland_B        NA

# count all by day (NA returned, expected: 6)
minimal %>% 
  incidence2::incidence(
    date_index = "date",
    counts = "cases_new"
  )
#> # incidence:  2 x 3
#> # count vars: cases_new
#>   date_index count_variable count
#> * <date>     <fct>          <dbl>
#> 1 2020-03-04 cases_new          4
#> 2 2020-03-05 cases_new         NA

# [alternative code to get the expected] 
# count all by day (NA removed in sum)
minimal %>% 
  select(date, cases_new) %>% 
  group_by(date) %>% 
  summarise(confirm = sum(cases_new, na.rm = TRUE)) %>% 
  ungroup()
#> # A tibble: 2 × 2
#>   date       confirm
#>   <date>       <dbl>
#> 1 2020-03-04       4
#> 2 2020-03-05       6

# count by region and month (NA returned, expected: 1)
minimal %>% 
  incidence2::incidence(
    date_index = "date",
    counts = "cases_new",
    groups = "region",
    interval = "month"
  )
#> # incidence:  2 x 4
#> # count vars: cases_new
#> # groups:     region
#>   date_index region     count_variable count
#> * <yrmon>    <chr>      <fct>          <dbl>
#> 1 2020-Mar   flatland_A cases_new          9
#> 2 2020-Mar   flatland_B cases_new         NA

Created on 2024-03-30 with reprex v2.1.0


The na.rm = FALSE in this line could be a target of the feature request.

https://github.com/reconverse/incidence2/blob/fa764c65452b4ebfc28245299900717367599254/R/incidence.R#L369

TimTaylor commented 7 months ago

Cheers @avallecam. I'm going to ponder this for a while as I do understand this may be desirable. The current design is due to me wanting to encourage people to think about how they impute missing values and make this a very explicit choice on their end. They may not want to treat them as zero values so wanted to force the issue, e.g.

# preprocessing minimal to handle NA values
minimal |> 
    na.omit() |>  
    incidence2::incidence(
        date_index = "date",
        counts = "cases_new",
        groups = "region",
        interval = "month"
    )
#> # incidence:  2 x 4
#> # count vars: cases_new
#> # groups:     region
#>   date_index region     count_variable count
#> * <yrmon>    <chr>      <fct>          <dbl>
#> 1 2020-Mar   flatland_A cases_new          9
#> 2 2020-Mar   flatland_B cases_new          1

minimal |> 
    filter(!is.na(cases_new)) |> 
    incidence2::incidence(
        date_index = "date",
        counts = "cases_new",
        groups = "region",
        interval = "month"
    )
#> # incidence:  2 x 4
#> # count vars: cases_new
#> # groups:     region
#>   date_index region     count_variable count
#> * <yrmon>    <chr>      <fct>          <dbl>
#> 1 2020-Mar   flatland_A cases_new          9
#> 2 2020-Mar   flatland_B cases_new          1

minimal |> 
    replace_na(list(cases_new = 0)) |> 
    incidence2::incidence(
        date_index = "date",
        counts = "cases_new",
        groups = "region",
        interval = "month"
    )
#> # incidence:  2 x 4
#> # count vars: cases_new
#> # groups:     region
#>   date_index region     count_variable count
#> * <yrmon>    <chr>      <fct>          <dbl>
#> 1 2020-Mar   flatland_A cases_new          9
#> 2 2020-Mar   flatland_B cases_new          1

I'll leave this open whilst I think about it some more.

avallecam commented 7 months ago

The current design is due to me wanting to encourage people to think about how they impute missing values and make this a very explicit choice on their end. They may not want to treat them as zero values (...)

Actually, I agree with the approach you propose. Do you think it would be valuable to add that statement explicitly somewhere in the package documentation?

For example, in the monthly_covid example in the get started vignette, we get different outputs if we add the suggested preprocessing step to handle NA values, specifically in the output for East Midlands. It is most visible in the country-level incidence calculation but this one may be a good enough space for this. Or in the incidence() function reference guide.

TimTaylor commented 7 months ago

Actually, I agree with the approach you propose. Do you think it would be valuable to add that statement explicitly somewhere in the package documentation?

Yes that's a good idea.

For example, in the monthly_covid example in the get started vignette, we get different outputs if we add the suggested preprocessing step to handle NA values, specifically in the output for East Midlands. It is most visible in the country-level incidence calculation but this one may be a good enough space for this. Or in the incidence() function reference guide.

This does look like a good motivating example. We could highlight the existing NA and explain why these have occurred. We could then highlight possible approaches. Do you want to propose a PR for this?

I'd also be interested in referencing other approaches to imputing missing values (not just replacing with 0). Is this something that has come up in any of the EpiVerse tutorials / package vignettes?

avallecam commented 7 months ago

Is this something that has come up in any of the EpiVerse tutorials / package vignettes?

yes, from reviewing tutorials for a coming in-person trial. The quantify transmission episode uses {incidence2} data as input for {EpiNow2} (permalink), so I thought of using incidence2::incidence() as a way to briefly showcase the function (not only the data stored on it which could also came from {covidregionaldata}) but noticed that using it was getting NA's on the initial dates, differing from the initial input written by the episode contributor, so opted to keep a classical {dplyr} approach:

library(dplyr)

cases <- incidence2::covidregionaldataUK %>%
  select(date, cases_new) %>%
  group_by(date) %>%
  summarise(confirm = sum(cases_new, na.rm = TRUE)) %>%
  ungroup()

now, from your suggestion, we could edit this step to:

library(tidyr)

cases <- incidence2::covidregionaldataUK %>% 
  # preprocess missing values
  tidyr::replace_na(list(cases_new = 0)) %>% 
  # compute the daily incidence
  incidence2::incidence(
    date_index = "date",
    counts = "cases_new",
    interval = "day",
    # rename column outputs to fit {EpiNow2}
    date_names_to = "date",
    count_values_to = "confirm"
  )
avallecam commented 7 months ago

Do you want to propose a PR for this?

Glad to start this PR, but I will not be in the capacity to do this in April, since I'll be focused on prep for a month of weekly tutorial trials, or May due to annual leave. So I'm also happy if you start this and I can contribute with my review.

TimTaylor commented 7 months ago

No worries - next time I look at the package I'll try and take a poke.

TimTaylor commented 5 months ago

@avallecam - I'm working on this at the moment. Plan is two-fold:

  1. Document this clearly in the vignette. The included Covid dataset works well for this.
  2. Add a warning when there are NA values present that could cause issues. I'll probably add an option to turn warning off as well but will have it on by default.
avallecam commented 5 months ago

@avallecam - I'm working on this at the moment. Plan is two-fold:

  1. Document this clearly in the vignette. The included Covid dataset works well for this.

  2. Add a warning when there are NA values present that could cause issues. I'll probably add an option to turn warning off as well but will have it on by default.

sounds good to me @TimTaylor . These actions make explicit the presence of NA outputs from the incidence() function. Point 1 could briefly add examples using na.omit(), filter(!is.na(cases_new)), replace_na(list(cases_new = 0)), as you wrote in the comment above https://github.com/reconverse/incidence2/issues/113#issuecomment-2031357996

TimTaylor commented 5 months ago

Point 1 could briefly add examples using na.omit(), filter(!is.na(cases_new)), replace_na(list(cases_new = 0)), as you wrote in the comment above #113 (comment)

I'm going with the tidyr::replace_na(list(cases_new = 0)) version in the documentation to emphasise that what we are really doing is imputing missing values. The other two are discarding rows from the data which could also remove other counts we want to use (as incidence() can take multiple count columns as input).

I'll poke when complete so you can have a gander :smiley:

TimTaylor commented 5 months ago

@avallecam - This is now changed in the development version and associated docs. Relevant section rendered in dev docs at https://www.reconverse.org/incidence2/dev/articles/incidence2.html#computing-incidence-from-pre-aggregated-data if you want a gander.

I've implemented the warning but have not yet implemented options to turn it off. Will raise another issue for this.