Open joshwlambert opened 6 months ago
This issue might also link with the {linelist} package and whether there is a line list standard from that package and whether any of the columns are grouped. If so an as_incidence.linelist()
S3 method might be beneficial. @Bisaloo is this the case or are all {linelist} tags for ungrouped columns?
Cheers @joshwlambert. Yes this is exactly how I'd handle it (outside of incidence). Will add an example along the lines of
outbreaks::ebola_sim_clean$linelist |>
pivot_wider(names_from = outcome, values_from = date_of_outcome) |>
incidence(
date_index = c(
onset = "date_of_onset",
hospitalisation = "date_of_hospitalisation",
death = "Death"
),
interval = "daily"
)
The issue we have is that incidence2 is expecting wide data (albeit potentially aggregated) whereas this is a mixture of wide and long. Whilst it may be possible to adapt allow for long-style "outcome" (and asociated date) columns I think the tidyr approach is so elegant my preference is just to ensure that is documented.
As you alude to we could do a lot with an as_incidence.linelist()
assuming that handles this wide/long mixture in it's specification - @Bisaloo?
As an aside I may also had a grates version to the examples to illustrate the different approaches.
As you alude to we could do a lot with an as_incidence.linelist() assuming that handles this wide/long mixture in it's specification - @Bisaloo?
Couple of thoughts on this:
as_incidence()
method. My view of as_()
methods is that they convert objects between two different but equivalent formats. But line list data and aggregated count data are two very different object. A different way to say it is that in most cases, it should be possible to do the class round-trip in a (quasi-)transparent, which would not be the case here as there is loss of information when aggregating the data. But maybe this is just a terminology issue.incidence()
function? E.g., in the case of a column tagged with date_outcome
, are we sure that users will always prefer to pivot and convert it to date_death
+ date_recovery
? Or can we imagine that they would be happy to pass date_outcome
directly to incidence()
. In others words, I see value for such a function only if we can provide more informative defaults than for standard data.frames. Is it really the case here?
- I've been considering it on multiple occasions but I'm still not convinced this should be a
as_incidence()
method. My view ofas_()
methods is that they convert objects between two different but equivalent formats. But line list data and aggregated count data are two very different object. A different way to say it is that in most cases, it should be possible to do the class round-trip in a (quasi-)transparent, which would not be the case here as there is loss of information when aggregating the data. But maybe this is just a terminology issue.
Interesting. I've always viewed as_
methods as (potentially lossy) casts (e.g. as.integer(1.5)
). Alternatively you could make the incidence()
funciton generic but that feels less satisfying (although cannot put my finger on way).
- Would we actually be able to do much more than with the default
incidence()
function? E.g., in the case of a column tagged withdate_outcome
, are we sure that users will always prefer to pivot and convert it todate_death
+date_recovery
? Or can we imagine that they would be happy to passdate_outcome
directly toincidence()
. In others words, I see value for such a function only if we can provide more informative defaults than for standard data.frames. Is it really the case here?
These are the interesting questions. I think it very much depends on what a typical input linelist (in the non-package sense) looks like. incidence only really handles wide, potentially aggregated, data as this was, in essence, what I inherited spec-wise. If data generally has this form with additional "outcome" and "date of outcome" we could perhaps adapt but I'm loathe to do so with out more of a formal spec of a linelist.
An aside: My gut feeling is that incidence2 is currently a package without a reason. I'm ok with this but thinks it's important to be open here. I tend to push people towards dplyr/data.table in combination with grates as this is more aligned with how I approach things. It could be useful if there were a range of methods (e.g. models for trend fitting) that were incorporated in incidence2 (calling functions from suggested packages) so people could easily go:
data -> incidence2 -> model fits by groups/counts.
but after 2 to 3 years I don't think there is a desire for this and unless a specific need comes up in ${DAYJOB} it's not something I'll push for.
It is unclear from the current {incidence2} documentation whether the package, specifically
incidence2::incidence()
, can handle grouped columns.An example of such a line list with a grouped column is the Ebola simulated line list in {outbreaks}.
Created on 2024-04-17 with reprex v2.1.0
The way I've been passing line list data like this to
incidence()
is usingtidyr::pivot_wider()
beforehand.Created on 2024-04-17 with reprex v2.1.0
These columns can then be selected using the
date_index
argument inincidence()
.Having the best way to work with this data documented somewhere in the {incidence2} package or add functionality to handle it would be great.