reconverse / incidence2

Compute and visualise incidence (reworking of the original incidence package)
https://www.reconverse.org/incidence2
Other
17 stars 2 forks source link

Document how to handle grouped line list columns #114

Open joshwlambert opened 6 months ago

joshwlambert commented 6 months ago

It is unclear from the current {incidence2} documentation whether the package, specifically incidence2::incidence(), can handle grouped columns.

An example of such a line list with a grouped column is the Ebola simulated line list in {outbreaks}.

head(outbreaks::ebola_sim_clean$linelist)
#>   case_id generation date_of_infection date_of_onset date_of_hospitalisation
#> 1  d1fafd          0              <NA>    2014-04-07              2014-04-17
#> 2  53371b          1        2014-04-09    2014-04-15              2014-04-20
#> 3  f5c3d8          1        2014-04-18    2014-04-21              2014-04-25
#> 4  6c286a          2              <NA>    2014-04-27              2014-04-27
#> 5  0f58c4          2        2014-04-22    2014-04-26              2014-04-29
#> 6  49731d          0        2014-03-19    2014-04-25              2014-05-02
#>   date_of_outcome outcome gender           hospital       lon      lat
#> 1      2014-04-19    <NA>      f  Military Hospital -13.21799 8.473514
#> 2            <NA>    <NA>      m Connaught Hospital -13.21491 8.464927
#> 3      2014-04-30 Recover      f              other -13.22804 8.483356
#> 4      2014-05-07   Death      f               <NA> -13.23112 8.464776
#> 5      2014-05-17 Recover      f              other -13.21016 8.452143
#> 6      2014-05-07    <NA>      f               <NA> -13.23443 8.468572

Created on 2024-04-17 with reprex v2.1.0

The way I've been passing line list data like this to incidence() is using tidyr::pivot_wider() beforehand.

library(magrittr)
outbreaks::ebola_sim_clean$linelist %>%
  tidyr::pivot_wider(
    names_from = outcome,
    values_from = date_of_outcome
  )
#> # A tibble: 5,829 × 12
#>    case_id generation date_of_infection date_of_onset date_of_hospitalisation
#>    <chr>        <int> <date>            <date>        <date>                 
#>  1 d1fafd           0 NA                2014-04-07    2014-04-17             
#>  2 53371b           1 2014-04-09        2014-04-15    2014-04-20             
#>  3 f5c3d8           1 2014-04-18        2014-04-21    2014-04-25             
#>  4 6c286a           2 NA                2014-04-27    2014-04-27             
#>  5 0f58c4           2 2014-04-22        2014-04-26    2014-04-29             
#>  6 49731d           0 2014-03-19        2014-04-25    2014-05-02             
#>  7 f9149b           3 NA                2014-05-03    2014-05-04             
#>  8 881bd4           3 2014-04-26        2014-05-01    2014-05-05             
#>  9 e66fa4           2 NA                2014-04-21    2014-05-06             
#> 10 20b688           3 NA                2014-05-05    2014-05-06             
#> # ℹ 5,819 more rows
#> # ℹ 7 more variables: gender <fct>, hospital <fct>, lon <dbl>, lat <dbl>,
#> #   `NA` <date>, Recover <date>, Death <date>

Created on 2024-04-17 with reprex v2.1.0

These columns can then be selected using the date_index argument in incidence().

daily <- incidence(
  linelist,
  date_index = c(
    onset = "date_of_onset",
    death = "Death"
  ),
  interval = "daily"
)

Having the best way to work with this data documented somewhere in the {incidence2} package or add functionality to handle it would be great.

joshwlambert commented 6 months ago

This issue might also link with the {linelist} package and whether there is a line list standard from that package and whether any of the columns are grouped. If so an as_incidence.linelist() S3 method might be beneficial. @Bisaloo is this the case or are all {linelist} tags for ungrouped columns?

TimTaylor commented 6 months ago

Cheers @joshwlambert. Yes this is exactly how I'd handle it (outside of incidence). Will add an example along the lines of

outbreaks::ebola_sim_clean$linelist |> 
    pivot_wider(names_from = outcome, values_from = date_of_outcome) |> 
    incidence(
        date_index = c(
            onset = "date_of_onset",
            hospitalisation = "date_of_hospitalisation",
            death = "Death"
        ),
        interval = "daily"
    )

The issue we have is that incidence2 is expecting wide data (albeit potentially aggregated) whereas this is a mixture of wide and long. Whilst it may be possible to adapt allow for long-style "outcome" (and asociated date) columns I think the tidyr approach is so elegant my preference is just to ensure that is documented.

As you alude to we could do a lot with an as_incidence.linelist() assuming that handles this wide/long mixture in it's specification - @Bisaloo?

As an aside I may also had a grates version to the examples to illustrate the different approaches.

Bisaloo commented 6 months ago

As you alude to we could do a lot with an as_incidence.linelist() assuming that handles this wide/long mixture in it's specification - @Bisaloo?

Couple of thoughts on this:

TimTaylor commented 6 months ago
  • I've been considering it on multiple occasions but I'm still not convinced this should be a as_incidence() method. My view of as_() methods is that they convert objects between two different but equivalent formats. But line list data and aggregated count data are two very different object. A different way to say it is that in most cases, it should be possible to do the class round-trip in a (quasi-)transparent, which would not be the case here as there is loss of information when aggregating the data. But maybe this is just a terminology issue.

Interesting. I've always viewed as_ methods as (potentially lossy) casts (e.g. as.integer(1.5)). Alternatively you could make the incidence() funciton generic but that feels less satisfying (although cannot put my finger on way).

  • Would we actually be able to do much more than with the default incidence() function? E.g., in the case of a column tagged with date_outcome, are we sure that users will always prefer to pivot and convert it to date_death + date_recovery? Or can we imagine that they would be happy to pass date_outcome directly to incidence(). In others words, I see value for such a function only if we can provide more informative defaults than for standard data.frames. Is it really the case here?

These are the interesting questions. I think it very much depends on what a typical input linelist (in the non-package sense) looks like. incidence only really handles wide, potentially aggregated, data as this was, in essence, what I inherited spec-wise. If data generally has this form with additional "outcome" and "date of outcome" we could perhaps adapt but I'm loathe to do so with out more of a formal spec of a linelist.

An aside: My gut feeling is that incidence2 is currently a package without a reason. I'm ok with this but thinks it's important to be open here. I tend to push people towards dplyr/data.table in combination with grates as this is more aligned with how I approach things. It could be useful if there were a range of methods (e.g. models for trend fitting) that were incorporated in incidence2 (calling functions from suggested packages) so people could easily go:

data -> incidence2 -> model fits by groups/counts.

but after 2 to 3 years I don't think there is a desire for this and unless a specific need comes up in ${DAYJOB} it's not something I'll push for.