Could the intervals be extended to month and/or month-year?

Lextuga007 commented 3 years ago

I want to give patientcounter a try with smoking prevalence data by team or ward and I have information over many years so the best way to 'count' the open people in a team or ward are by referrals by month-year. Patientcounter only goes to day - is that right?

johnmackintosh commented 2 years ago

Hi @Lextuga007 - I've only just seen this, not sure why I wasn't notified before.

as far as I know, if it works with cut, it should work - this is the guidance for cut.POSIXct:

I'd be happy to take a look if you have some trial data you could share (offline)?

will-ball commented 1 year ago

Hey @johnmackintosh did you guys end up finding out if this worked? I'm potentially going to be doing a count of folks added before but not removed from a register on a specific date over multiple years. It appears that specifying "year" would be fine - how would I go about setting the day & month to check at?

johnmackintosh commented 1 year ago

@will-ball I never got round to looking into this in detail. In reference to @Lextuga007's comment, the package doesn't necessarily only go to day level, but it does expect date-time, rather than dates. It was created due to the need for needing hourly or even finer grained counts.

If you use the individual level, the function returns a row per individual per interval, including the original start and end datetimes, plus the interval's base date and hour - which you can use to filter results to a specific date and time.

Alternatively, maybe you could use data.table's rolling joins?

https://www.gormanalysis.com/blog/r-data-table-rolling-joins/

https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html

If you have some fake data to play around with, would be happy to take a look at all the options

will-ball commented 1 year ago

Thanks for getting back to me @johnmackintosh

I've not encountered rolling joins before so will take a look, thanks for flagging. I've got a toy dataset to illustrate:

# Simple Example
library(tidyverse)
library(lubridate)
library(truncnorm)

n_people <- 1000

start_date <- as_date("2012-01-01")
end_date <- as_date("2021-12-31")

set.seed(20221214)

data <- as_tibble(
  list(
    id = sample(1:n_people, replace = TRUE),
    added = start_date + sample.int(end_date - start_date, n_people))) %>% 
  mutate(
    removed = added + rtruncnorm(n_people, mean = 30, sd = 15, a = 1, b = 1000),
    days = added %--% removed %/% days(1))

From data which essentially looks like this, I'd like to count how many people are 'registered' on the 31st July each year. I don't think it should complicate anything but the same person can be added/removed multiple times. I will have a play myself but if you get bored and want to take a look let me know.

johnmackintosh commented 1 year ago

see if this gives you what you need @will-ball ?

library(tidyverse)
library(lubridate)
library(truncnorm)

library(patientcounter)

n_people <- 1000

start_date <- as_date("2012-01-01")
end_date <- as_date("2021-12-31")

set.seed(20221214)

data <- as_tibble(
  list(
    id = sample(1:n_people, replace = TRUE),
    added = start_date + sample.int(end_date - start_date, n_people))) %>% 
  mutate(
    removed = added + rtruncnorm(n_people, mean = 30, sd = 15, a = 1, b = 1000),
    days = added %--% removed %/% days(1))

data2 <- data %>% 
  mutate(added  = as.POSIXct(added), 
         removed = as.POSIXct(removed))

results <- interval_census(data2, 
                           identifier = 'id', 
                           admit = "added", 
                           discharge = "removed", 
                           time_unit = '1 day', 
                           results = 'patient')

results[lubridate::month(base_date)== 7 & lubridate::day(base_date) == 31] %>% 
  arrange(.,id, added)

johnmackintosh commented 1 year ago

results[lubridate::month(base_date)== 7 & lubridate::day(base_date) == 31,.N, .(base_date)]

will give you tallies for each cutoff date

will-ball commented 1 year ago

That works perfectly thanks 😄

johnmackintosh commented 1 year ago

Nice one @will-ball Not sure I've been any use to @Lextuga007 yet so will leave this open for now

Lextuga007 commented 1 year ago

Yes, it does look like "year" is supported as time_unit parameter feeds into {lubridate} functions. However, when I run a smaller example for years there is a strange thing when an end date is already "floored":

library(dplyr)
library(patientcounter)

df <- tibble::tribble(
  ~id,  ~start_date,    ~end_date, ~smoking_status,
   5L, "2024-08-01", NA, "smoker",
   1L, "2019-01-01", "2020-01-01",        "smoker",
   2L, "2019-01-02", "2020-01-02",    "non-smoker",
   3L, "2019-01-03", "2022-01-01",        "smoker",
   4L, "2019-01-04", NA,    "non-smoker"
  ) |> 
  mutate(start_date = as.POSIXct(start_date),
         end_date = as.POSIXct(end_date))

results <- interval_census(df, 
                           identifier = 'id', 
                           admit = "start_date", 
                           discharge = "end_date", 
                           time_unit = 'year', 
                           results = 'patient') |> 
  arrange(id)

id 1 should get 2019 and 2020 but because it's end date is on the 1st 2020 doesn't show. I'm guessing but is this something related to the date times and the time is tipping it to 2019-12-31? The same happens with id 3 which should be 2019, 2020, 2021 and 2022 but 2022 is dropped.

johnmackintosh commented 1 year ago

Hmm, I wonder if that is timezone related. I haven't tried your code yet, but I've encountered issues with the changeover from BST/ GMT if UTC has not been explicitly declared.

I don't have much bandwidth to look into this at present.

Another possible influencing factor is my use of "within" as the method used with foverlaps. I was thinking about making that a parameter in the main function so that folk can use whatever method suits them best.

Will try and get that sorted soon.

Lextuga007 commented 1 year ago

Tom Jemmett https://github.com/tomjemmett wrote this code which I've adapted for the data I used and it's made me realise that what I need to count is not really a census as I don't want to subtract people who leave for something like prevalence.

df |> 
  tidyr::pivot_longer(-c(id, smoking_status), 
                      values_to = "date") |>
  dplyr::mutate(n = ifelse(name == "start_date", 1, -1)) |>
  tidyr::replace_na(list(date = lubridate::today())) |> 
  dplyr::mutate(date = lubridate::floor_date(date, "year")) |> 
  dplyr::arrange(date, smoking_status) |>
  dplyr::mutate(c = cumsum(n),
                .by = smoking_status) |> 
  dplyr::select(-name, -id, -n) |>  
  dplyr::slice_tail(n = 1, by = c(date, smoking_status)) |> 
  tidyr::complete(date = seq(min(date), max(date), by = "year")) |> 
  tidyr::fill(c(c, smoking_status)) |>
  tidyr::replace_na(list(c = 0))

I think for prevalence I'd need to drop the generating of -1 for an exit.

johnmackintosh / patientcounter

Could the intervals be extended to month and/or month-year? #14