nhs-r-community / demos-and-how-tos

A repo for community contributed demos and how-tos to get common stuff done in the R language
https://nhs-r-community.github.io/demos-and-how-tos/
MIT License
29 stars 13 forks source link

Add caseload size calculation #44

Open ChrisBeeley opened 1 year ago

ChrisBeeley commented 1 year ago

Using this script to produce example data (or submitting your own example data) please add a function that calculates the caseload size per team (or teams) by day (or range of days).

See the README for an explanation of the data.

We would love multiple solutions to this please- tidyverse, base, data.table, SQL, Python, you name it :slightly_smiling_face:

ChrisBeeley commented 1 year ago

PS I banged out the synthetic data, if anybody has improvements or comments I'd be glad to hear them

NHSmatthewcallus commented 1 year ago

This is my attempt using SQL:

caseload_size_MC.txt

LeslieHewitt commented 1 year ago

This works for me:

`-- Caseload

declare @StartDate date = '2022-08-31' declare @EndDate date = getdate()

;with cteCaseload as ( select IDPatient ,CONVERT(date,ref.[DateReferral]) as ReferralDate ,ref.[Discharge_Date] ,TeamName ,1 as Caseload FROM MyReferralsTable ref where ref.[DateReferral]<@EndDate and (ref.Discharge_Date>=@StartDate OR ref.Discharge_Date is null) )

select c.IDPatient ,d.Date as EventDate ,c.TeamName ,1 as Caseload from MyDateTable d
left join cteCaseload c on c.ReferralDate <= d.Date
and (c.[Discharge_Date] >= d.Date or c.[Discharge_Date] is null) where d.Date >= @StartDate and d.Date <= @EndDate

`

jontix commented 1 year ago

This one for a dplyr version...

It uses dates from the synthetic data generation script, but that can be changed to whatever dates are required.

team_caseload <- test_frame %>%
  full_join(data.frame(dates), by = character()) %>%
  group_by(client_id, team_desc, dates) %>%
  summarise(is_in = max(referral_date <= dates 
                        & (discharge_date > dates | is.na(discharge_date)))) %>%
  group_by(team_desc, dates) %>%
  summarise(total_caseload_on_date = sum(is_in))
Lextuga007 commented 1 year ago

From @johnmackintosh https://gist.github.com/johnmackintosh/1d7b1763ef3cf77d5a2f6357638606b7

tomjemmett commented 1 year ago

you could treat this as a continous function by pivotting the dates into a single column, then using referral_date as +1, discharge_date as -1. If you then arrange on the pivotted date column, sort by that date, we can then cumulative sum the count column to give us our caseload at any point in time. This would probably be a more interesting approach with a date time column.

There is one big caveat if you were using this, you would need to write a query that extractted date before your period of interest, e.g. .data$discharge_date > start_date | .data$referral_date < end_date | is.na(.data$discharge_date)

library(tidyverse)

add_end_row <- function(.data, end_date = NULL) {
  if (is.null(end_date)) {
    end_date <- max(.data$date)
  }
  bind_rows(.data, summarise(.data, date = end_date, across(count, last)))
}

test_frame |>
  pivot_longer(ends_with("date"), names_to = "date_type", values_to = "date") |>
  drop_na(date) |>
  mutate(count = ifelse(date_type == "referral_date", 1, -1)) |>
  arrange(date) |>
  select(team_desc, date, count) |>
  group_by(team_desc) |>
  mutate(across(count, cumsum)) |>
  add_end_row(Sys.Date()) |>
  ggplot(aes(date, count, colour = team_desc)) +
  geom_step()
Lextuga007 commented 1 year ago

The new package {ivs} https://github.com/DavisVaughan/ivs might also be useful for this YouTube 50:52