Closed avallecam closed 5 months ago
Cheers @avallecam. I'm going to ponder this for a while as I do understand this may be desirable. The current design is due to me wanting to encourage people to think about how they impute missing values and make this a very explicit choice on their end. They may not want to treat them as zero values so wanted to force the issue, e.g.
# preprocessing minimal to handle NA values
minimal |>
na.omit() |>
incidence2::incidence(
date_index = "date",
counts = "cases_new",
groups = "region",
interval = "month"
)
#> # incidence: 2 x 4
#> # count vars: cases_new
#> # groups: region
#> date_index region count_variable count
#> * <yrmon> <chr> <fct> <dbl>
#> 1 2020-Mar flatland_A cases_new 9
#> 2 2020-Mar flatland_B cases_new 1
minimal |>
filter(!is.na(cases_new)) |>
incidence2::incidence(
date_index = "date",
counts = "cases_new",
groups = "region",
interval = "month"
)
#> # incidence: 2 x 4
#> # count vars: cases_new
#> # groups: region
#> date_index region count_variable count
#> * <yrmon> <chr> <fct> <dbl>
#> 1 2020-Mar flatland_A cases_new 9
#> 2 2020-Mar flatland_B cases_new 1
minimal |>
replace_na(list(cases_new = 0)) |>
incidence2::incidence(
date_index = "date",
counts = "cases_new",
groups = "region",
interval = "month"
)
#> # incidence: 2 x 4
#> # count vars: cases_new
#> # groups: region
#> date_index region count_variable count
#> * <yrmon> <chr> <fct> <dbl>
#> 1 2020-Mar flatland_A cases_new 9
#> 2 2020-Mar flatland_B cases_new 1
I'll leave this open whilst I think about it some more.
The current design is due to me wanting to encourage people to think about how they impute missing values and make this a very explicit choice on their end. They may not want to treat them as zero values (...)
Actually, I agree with the approach you propose. Do you think it would be valuable to add that statement explicitly somewhere in the package documentation?
For example, in the monthly_covid
example in the get started vignette, we get different outputs if we add the suggested preprocessing step to handle NA values, specifically in the output for East Midlands
. It is most visible in the country-level incidence calculation but this one may be a good enough space for this. Or in the incidence() function reference guide.
Actually, I agree with the approach you propose. Do you think it would be valuable to add that statement explicitly somewhere in the package documentation?
Yes that's a good idea.
For example, in the
monthly_covid
example in the get started vignette, we get different outputs if we add the suggested preprocessing step to handle NA values, specifically in the output forEast Midlands
. It is most visible in the country-level incidence calculation but this one may be a good enough space for this. Or in the incidence() function reference guide.
This does look like a good motivating example. We could highlight the existing NA and explain why these have occurred. We could then highlight possible approaches. Do you want to propose a PR for this?
I'd also be interested in referencing other approaches to imputing missing values (not just replacing with 0). Is this something that has come up in any of the EpiVerse tutorials / package vignettes?
Is this something that has come up in any of the EpiVerse tutorials / package vignettes?
yes, from reviewing tutorials for a coming in-person trial. The quantify transmission episode uses {incidence2} data as input for {EpiNow2} (permalink), so I thought of using incidence2::incidence()
as a way to briefly showcase the function (not only the data stored on it which could also came from {covidregionaldata}) but noticed that using it was getting NA's on the initial dates, differing from the initial input written by the episode contributor, so opted to keep a classical {dplyr} approach:
library(dplyr)
cases <- incidence2::covidregionaldataUK %>%
select(date, cases_new) %>%
group_by(date) %>%
summarise(confirm = sum(cases_new, na.rm = TRUE)) %>%
ungroup()
now, from your suggestion, we could edit this step to:
library(tidyr)
cases <- incidence2::covidregionaldataUK %>%
# preprocess missing values
tidyr::replace_na(list(cases_new = 0)) %>%
# compute the daily incidence
incidence2::incidence(
date_index = "date",
counts = "cases_new",
interval = "day",
# rename column outputs to fit {EpiNow2}
date_names_to = "date",
count_values_to = "confirm"
)
Do you want to propose a PR for this?
Glad to start this PR, but I will not be in the capacity to do this in April, since I'll be focused on prep for a month of weekly tutorial trials, or May due to annual leave. So I'm also happy if you start this and I can contribute with my review.
No worries - next time I look at the package I'll try and take a poke.
@avallecam - I'm working on this at the moment. Plan is two-fold:
@avallecam - I'm working on this at the moment. Plan is two-fold:
Document this clearly in the vignette. The included Covid dataset works well for this.
Add a warning when there are NA values present that could cause issues. I'll probably add an option to turn warning off as well but will have it on by default.
sounds good to me @TimTaylor . These actions make explicit the presence of NA outputs from the incidence() function. Point 1 could briefly add examples using na.omit()
, filter(!is.na(cases_new))
, replace_na(list(cases_new = 0))
, as you wrote in the comment above https://github.com/reconverse/incidence2/issues/113#issuecomment-2031357996
Point 1 could briefly add examples using
na.omit()
,filter(!is.na(cases_new))
,replace_na(list(cases_new = 0))
, as you wrote in the comment above #113 (comment)
I'm going with the tidyr::replace_na(list(cases_new = 0))
version in the documentation to emphasise that what we are really doing is imputing missing values. The other two are discarding rows from the data which could also remove other counts we want to use (as incidence()
can take multiple count columns as input).
I'll poke when complete so you can have a gander :smiley:
@avallecam - This is now changed in the development version and associated docs. Relevant section rendered in dev docs at https://www.reconverse.org/incidence2/dev/articles/incidence2.html#computing-incidence-from-pre-aggregated-data if you want a gander.
I've implemented the warning but have not yet implemented options to turn it off. Will raise another issue for this.
Please place an "x" in all the boxes that apply
Please include a brief description of the problem with a code example:
When
NA
values are present in observations, counting cases at any time interval results inNA
observations by default. This feature is not flexible to change compared with alternative code wherena.rm = TRUE
is possible to define forsum()
.Feature request: possibly, make
na.rm
an option for the user?Created on 2024-03-30 with reprex v2.1.0
The
na.rm = FALSE
in this line could be a target of the feature request.https://github.com/reconverse/incidence2/blob/fa764c65452b4ebfc28245299900717367599254/R/incidence.R#L369