nhs-r-community / NHSRplotthedots

An SPC package to support NHSE/I 'Making Data Count' programme
https://nhs-r-community.github.io/NHSRplotthedots/
Other
47 stars 22 forks source link

NAs causing unexpected special cause flags #156

Open gdfiler opened 2 years ago

gdfiler commented 2 years ago

Hi All I'm new to github and NHSRplotthedots, fairly new to NHSR Community and not that experienced in R either so please be forgiving if this is not appropriate or has been dealt with elsewhere - I came across an apparent issue today which I thought I would share.

I've been playing around with the NHSRplotthedots package and accidently left an NA value in my dataset which caused some special cause flags that should have been common cause.

Here is an example of the issue:

`library(NHSRplotthedots) library(NHSRdatasets) library(dplyr) library(ggplot2) library(scales)

data("ae_attendances")

stable_set <- ae_attendances %>% filter(org_code == "RRK", type ==1, period < as.Date("2018-04-01"))

ptd_spc(stable_set, value_field = breaches, date_field = period, improvement_direction = "decrease")

Note last 6 data points show common cause variation`

image

`#Now set the last data point to NA and rerun SPC

stable_set$breaches[stable_set$period==as.Date("2018-03-01")] <- NA ptd_spc(stable_set, value_field = breaches, date_field = period, improvement_direction = "decrease")

Now last 5 points show special cause variation`

image

Thanks for all the great work you are doing and looking forward to future developments and more packages from the community! Gary

ThomUK commented 2 years ago

Thanks for raising this issue, and excellent catch. We'll need to decide what the behaviour should be here. My preference would be to tolerate the NA, dropping the point from the x-axis (the real world is messy and does occasionally contain unavoidable holes with missing data)

Reproducing the case with no NAs: image

Reproducing the case with an NA: image

The point_type column is calculated in the code below, which looks at both the special_cause_flag, and relative_to_mean columns. These are both NA, so line 29 is executed. https://github.com/nhs-r-community/NHSRplotthedots/blob/19424d38b59184863f01a54280270542e97bc491/R/ptd_calculate_point_type.R#L25-L30

The values for special_cause_flag and relative_to_mean columns are set here, so this is likely where the code to tolerate NAs will need to be added: https://github.com/nhs-r-community/NHSRplotthedots/blob/19424d38b59184863f01a54280270542e97bc491/R/ptd_spc_standard.R#L91-L97

The next stage is probably for someone to write a failing test. It would be worth also checking for other bugs that might be created by the other NAs that appear in that row.

ThomUK commented 2 years ago

Just a note, that as a workaround, you can pre-filter the incoming dataframe to contain only dates with data. This gives a result with gaps in the x axis, which makes the missing data more obvious while not affecting the SPC logic.

data("ae_attendances")

stable_set <- ae_attendances %>%
  filter(org_code == "RRK",
         type ==1,
         period < as.Date("2018-04-01"))

# remove some data
stable_set$breaches[stable_set$period==as.Date("2018-03-01")] <- NA

# removing more data in the middle of the plot, as an illustration
stable_set$breaches[stable_set$period==as.Date("2017-06-01")] <- NA

# filter to remove any dates with NAs
filtered_set <- stable_set %>% filter(!is.na(breaches))

ptd_spc(filtered_set, value_field = breaches, date_field = period, improvement_direction = "decrease")

image

ThomUK commented 2 years ago

I am leaning towards not implementing any changes to tolerate or work around NAs. Perhaps we should throw a warning to prompt to user to look more closely at their data? The user is in control of the data being passed in.

Open to thoughts from others...

gdfiler commented 2 years ago

If I recall correctly, the documentation clearly states to not include NAs. I had read the documentation but still accidently included them so I like the idea of a warning when NAs are present to prompt a check of the data.

When NAs are included there is a risk that the users without experience of SPC may interpret the false special cause flags as real. If an error message will mitigate that risk then all good.

tomjemmett commented 2 years ago

I think the best solution here would be to have a check along the lines of stopifnot(all(!is.na(x_values)) [pseudocode]. The error message should suggest to use tidyr::drop_na(x_values) explicitly.

I don't like the idea of doing implicit dropping of values... the logic would be a mess (if there are no NA's, do nothing. If there are, drop na's but give a warning). I think the cleanness of raising an error but telling what to do to fix the issue is best.

jaspercain commented 6 months ago

I've just had a use case where I have encountered a similar behaviour, but with a different cause. For context, I'm using the function to produce the special cause flag for all ward areas, which is then being exported for use within a heatmap on our BI tool. I have multiple cases where 0 values exist for the whole period (e.g, number of harm falls could be zero for the whole period). These should not be counted as a special cause for concern However, where the entire period is 0, the function cannot calculate the lower or upper control limits, producing an NA for "special_cause_flag", and a subsequent "special_cause_concern"

SPC object NA values

I have put in place a workaround for my workflow with a simple REPLACE exceptions$point_type <- replace(exceptions$point_type, is.na(exceptions$special_cause_flag),values=c("common_cause"))

I'm not sure if this needs a fix as such since this is probably a niche usage, but potentially needs a warning where the control limits are NaN.