rkillick / changepoint

A place for the development version of the changepoint package on CRAN.
127 stars 33 forks source link

logLik() should return an object of class "logLik" #77

Open beanumber opened 5 months ago

beanumber commented 5 months ago

Thanks for the great work on this package!

The behavior of changepoint::logLik.cpt() is problematic for three reasons:

  1. It returns a numeric vector of length 2 (instead of the expected 1)
  2. It returns an object of class double instead of an object of class logLik, and thus the resulting object doesn't have the attributes that logLik objects should have.
  3. It doesn't return the actual log-likelihood (but rather -2 times the log-likelihood)!

This means that other generic functions already defined in stats like AIC() and BIC() don't work as expected.

library(changepoint)
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
#> Successfully loaded changepoint package version 2.2.4
#>  See NEWS for details of changes.
x <- cpt.meanvar(wave.c44137, penalty = "AIC")

# current behavior
logLik(x)
#>      -2*logLik -2*Loglike+pen 
#>       215528.5       215532.5
str(logLik(x))
#>  Named num [1:2] 215529 215533
#>  - attr(*, "names")= chr [1:2] "-2*logLik" "-2*Loglike+pen"
AIC(x)
#> Error in UseMethod("logLik"): no applicable method for 'logLik' applied to an object of class "cpt"
BIC(x)
#> Error in UseMethod("logLik"): no applicable method for 'logLik' applied to an object of class "cpt"

Created on 2024-04-03 with reprex v2.1.0

I think it would be better if changepoint::logLik.cpt() returned a logLik object with the appropriate attributes and values. Something like this should do the trick:

x <- changepoint::cpt.meanvar(changepoint::wave.c44137, penalty = "AIC")

logLik.cpt <- function(object, ...) {
  y <- changepoint::likelihood(object) |>
    suppressWarnings()
  ll <- -y[1] / 2
  attr(ll, "df") <- length(object@cpts)
  attr(ll, "nobs") <- length(object@data.set)
  class(ll) <- "logLik"
  return(ll)
}

# preferred behavior
logLik(x)
#> 'log Lik.' -107764.3 (df=2)
str(logLik(x))
#> Class 'logLik' : -107764 (df=2)
attributes(logLik(x))
#> $names
#> [1] "-2*logLik"
#> 
#> $df
#> [1] 2
#> 
#> $nobs
#> [1] 63651
#> 
#> $class
#> [1] "logLik"
AIC(x)
#> [1] 215532.5
BIC(x)
#> [1] 215550.7

Created on 2024-04-03 with reprex v2.1.0

rkillick commented 5 months ago

Thanks for starting a discussion on this. There are a couple of reasons I avoided returning the logLik class:

  1. I originally didn't implement the logLik function at all, this was suggested by a reviewer of the JSS paper so was added later.
  2. The changepoints listed are not valid for ANY penalty function, only the penalty that was used to generate the changepoint set. Thus just returning the logLik and allowing users to easily put this into AIC() and BIC() might encourage them to think that this is a valid thing to do.
  3. I return a two-length vector as users often want the raw cost function at the optimal value (for comparison with fits from other optimization procedures) alongside the optimised value from the algorithm used (the penalised log-likelihood+penalty).
  4. I return -2*log-like as this is the cost that is optimised - the documentation and output is very clear on this. I agree that this isn't standard and, whilst I'm open to changing this, it would not be backwards compatible so I'm conscious it may break (without error) a lot of existing code for users.

I'm open to hearing rebuttal to these points.

beanumber commented 5 months ago

Thanks for the quick response!

FWIW, I still think it would be better to:

With respect to your reasoning:

  1. LOL. Was it Reviewer 2??
  2. I'm new to this, so bear with me, but I think you mean "optimal" instead of "valid". That is, I think what you are saying is that since the changepoint set returned by the algorithm (let's call it $\tau$) is only optimal under the use of the penalty function specified in the function call, it would be wrong to then use the log-likelihood reported by the algorithm, compute the AIC (or BIC or whatever different penalty function) and then claim that that value (AIC($\tau$)) is the lowest possible AIC. That would be an incorrect interpretation of the AIC value, because in order for that claim to be justified, you would have to run the algorithm again with this penalty function specified. So that's what you're worried about. We're in agreement on that. However, I don't understand what would be invalid about simply computing AIC($\tau$). It's just a value, and as long as you understand that it's not necessarily the optimal value, I don't see what the danger is. In fact, I suspect it might be useful in a comparative analysis of penalty functions, etc.
  3. Fair enough, but to me this adds grist to my mill, because the two values that you're reporting are more clearly reported by the behavior I've outlined through the functions logLik() and AIC(). If these functions worked as expected as above, it would be more straightforward to do the kind of comparative analysis you're talking about. If you want the penalty value, and you're using AIC, then just run AIC(x) and you get that value. In the current framework, it's logLik(x)[2], which is not so obvious. Also, if you want the actual log-likelihood, you have to compute -logLik(x)/2, which is also weird.
  4. I would think about a different function, called objective_fun() (or whatever) that returns this value. I would also change the name of the vector to be the name of the penalty function. So in the above case, instead of -2*Loglike+pen the name would be AIC (or mBIC or whatever was appropriate).

In general, I think function names should reflect what the function does as much as possible, and methods should work within the guidelines of their corresponding generic. So logLik() should return a log-likelihood value (and corresponding class as defined in stats), AIC() should return an AIC value, and objective_fun() should return the value of the objective function. [Generally, each function should probably return only one value.]

Thanks for considering this. If you're curious, this is the compatibility layer I've written for changepoint in tidychangepoint. So I have already have my workaround. But as per Hadley's advice, I'm trying to push the logLik.cpt() method into changepoint rather than overwriting it (bad manners!) in tidychangepoint! :)

rkillick commented 5 months ago

I'm open to changing this functionality, especially for logLik() as it is bad practice.

  1. This is a problem as people don't understand that when you run BIC() it isn't the optimal set. I have found, many times, that the majority of people that use the package are not stats people. They often find code to do things on the internet and replicate that code for their own data. This is great that people have access to use more complex stats methods because of this but the nuance that when they run lm() and then AIC() that is appropriate, but running cpt.*() and then BIC() isn't appropriate is lost. I enjoy trying to make code that these people can use and understand how to use easily.
  2. As I said above, happy to change the logLik and add an objective_fun() or similarly name function to deal with the penalized likelihood calculation.
  3. The name is hard because either it is a pre-set like BIC, MBIC, or it is "manual" and I don't think that makes sense as a name. Hence I've described it by the mathematical abstraction to words instead - this works for all penalty values.

As a side note, no one should ever be using AIC() with changepoint models - it asymptotically recovers too many changepoints.

beanumber commented 4 months ago

Thanks for indulging me. :)

FYI, I'm now using fitness() as a generic method to return a named vector with the value of the objective function. So it works like this:

library(tidychangepoint)
x <- segment(CET, method = "pelt")
#> method: pelt
fitness(x)
#>     MBIC 
#> 23.56658
x$segmenter
#> Class 'cpt' : Changepoint Object
#>        ~~   : S4 class containing 12 slots with names
#>               cpttype date version data.set method test.stat pen.type pen.value minseglen cpts ncpts.max param.est 
#> 
#> Created on  : Thu Jun  8 14:09:02 2023 
#> 
#> summary(.)  :
#> ----------
#> Created Using changepoint version 2.2.4 
#> Changepoint type      : Change in mean and variance 
#> Method of analysis    : PELT 
#> Test Statistic  : Normal 
#> Type of penalty       : MBIC with value, 23.56658 
#> Minimum Segment Length : 2 
#> Maximum no. of cpts   : Inf 
#> Changepoint Locations : 55 57 309 311 330

Created on 2024-04-23 with reprex v2.1.0

beanumber commented 4 months ago

Oh, but also, I just realized my computation is wrong, because fitness() should return the value of the objective function, not just the penalty!