jack-davison / ggopenair

{ggplot2} incarnation of {openair} ☁️
https://jack-davison.github.io/ggopenair/
GNU General Public License v3.0
6 stars 0 forks source link

Implement `theilSen()` #4

Open jack-davison opened 1 year ago

jack-davison commented 1 year ago

This is the one openair function that doesn't have a ggopenair equivalent, in part because it is the one that feels most like it should be a geom or stat.

One could imagine doing the below, for example:

ggplot(mydata, aes(x = date, y = nox)) + geom_theilsen()

Need to check if this already exists somewhere else or, if not, need to confirm how to implement it.

If it is the case that geom_smooth() can achieve this straightforwardly, then there may need to be no need to add this at all (in the same way that smoothTrend() doesn't need migrating.

mooibroekd commented 9 months ago

Hi @jack-davison and @davidcarslaw ,

I have done some comparison with two packages I have found that allow for theilSen calculations: mblm and RobustLinearReg. Both these functions can be inserted in geom_smooh() using the following approach:

library(mblm)
library(openair)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

mydata <- openair::mydata

# define the function
sen <- function(..., weights = NULL) {
  #RobustLinearReg::theil_sen_regression(...)
  mblm::mblm(..., repeated = FALSE)
}

p1 <- mydata |>
  openair::timeAverage(avg.time = "month") |> 
  ggplot2::ggplot(aes(date, nox)) +   
  ggplot2::geom_point() +   
  ggplot2::geom_line() +
  ggplot2::geom_smooth(method = sen) +
  ggplot2::theme_minimal()

p1
#> `geom_smooth()` using formula = 'y ~ x'


# Do the openair plot:
theilsen_data <- TheilSen(mydata, 
                          pollutant = "nox", 
                          avg.time = "month",
                          statistic = "mean",
                          deseason = FALSE)
#> Taking bootstrap samples. Please wait.


# Combine the data into one plot
p2 <- p1 + 
  geom_abline(slope = theilsen_data$data$res2$slope/(60*60*24*365.2422),
              intercept = theilsen_data$data$res2$intercept,
              color = "red",
              linewidth = 1) +
  geom_abline(slope = theilsen_data$data$res2$lower/(60*60*24*365.2422),
              intercept = theilsen_data$data$res2$intercept.lower,
              color = "red",
              linewidth = 0.5,
              linetype = "dashed") +
  geom_abline(slope = theilsen_data$data$res2$upper/(60*60*24*365.2422),
              intercept = theilsen_data$data$res2$intercept.upper,
              color = "red",
              linewidth = 0.5,
              linetype = "dashed")

p2
#> `geom_smooth()` using formula = 'y ~ x'


###############################################
# check the outcomes of slope and intercepts:
###############################################

# Calculate the model directly
mblm_data <- mydata |>
  dplyr::select(date, nox) |>
  openair::timeAverage(avg.time = "month") |>
  na.omit() |> # needed as mblm cannot handle NA
  dplyr::mutate(date = as.numeric(date)) # get the seconds

model <- mblm::mblm(formula = nox~date,
                    mblm_data,
                    repeated = FALSE)

# slope from theilSen openair:
theilsen_data$data$res2$slope
#> [1] -8.44159

# slope from mblm
model$coefficients[[2]] * (60*60*24*365.2422)
#> [1] -8.447192

# confidence intervals of the slope
c(theilsen_data$data$res2$lower, theilsen_data$data$res2$upper)
#> [1] -11.671282  -5.439937

mblm::confint.mblm(model)[2,]  * (60*60*24*365.2422)
#>     0.025     0.975 
#> -1325.990  1364.582

confint.default(model)[2,]  * (60*60*24*365.2422)
#>     2.5 %    97.5 % 
#> -11.95284  -4.94154

# The latter looks more like the ones from theilSen() in openair. The first 
# CI's are weird to say the least. 

# intercept from theilSen openair:
theilsen_data$data$res2$intercept
#> [1] 441.5602

# intercept from mblm
model$coefficients[[1]]
#> [1] 439.9073

Created on 2023-12-20 with reprex v2.0.2

In general, we can see that the slope is very similar, and there might be a difference as I might not have used the correct correction factor (number of seconds in a year). There is a slight offset in the intercept which I cannot explain fully.

Some other observations from my testing: