jbytecode / LinRegOutliers

Direct and robust methods for outlier detection in linear regression
MIT License
43 stars 6 forks source link

feature request: methods for `(y,X)` #14

Closed PaulSoderlind closed 3 years ago

PaulSoderlind commented 3 years ago

Hi,

following up the discussion on Discourse, I would kindly ask for method for (y,X).

Motivation: while GLM and friends are often useful, it is sometimes easier to just do 'b = X\y' etc.

Feasibility: looking at your code, it sometimes (like in lad.jl) starts with

X = designMatrix(setting)
Y = responseVector(setting)

In these cases, it should be straightforward to add methods. (I am busy with teaching right now, but might able submit PRs later this autumn.)

jbytecode commented 3 years ago

Hi @PaulSoderlind ,

I am not against defining formulas in this way because the terms in linear regressions are always summed up in the same form. My primary motivation was to use a known method for those came from R or Julia users who already use GLM.

Thank to multiple dispatch, new methods with same name would be implemented, for example

bch(setting::RegressionSetting; alpha=0.05, maxiter=1000, epsilon=0.000001)

method uses the standart @formula type definition. Another bch would be

bch(data::Tuple{Matrix, Array{Float64, 1}); alpha=0.05, maxiter=1000, epsilon=0.000001)

or something similar.

It is important to have first method because despite being linear, some models have more complex design matrices which are difficult to construct by hand. Including an intercept or not, dummy variables, changing intercept or slope or both by dummies etc. separates the concepts of 'summation of independent variables' and 'the design matrix'.

Are you agreed with this? If yes, we can discuss implementation details later.

PaulSoderlind commented 3 years ago

Yes, this sounds good. Thanks

jbytecode commented 3 years ago

okay @PaulSoderlind , it would be good to see you as a contributor. It should not be complete, you can make pull requests than include partial changes when you have time. Thank you in advance. Welcome :)

PaulSoderlind commented 3 years ago

Hi, I would be happy to contribute, but it will take some due to my teaching.

Still, that does not have to stop us from thinking a bit about how to do it. To my mind, the best would be to use dispatch in such a way that there is no need to duplicate code. To illustrate what I mean, consider this refactoring of lad.jl:

function lad(data::Tuple{Vector,Matrix); starting_betas=nothing) 
  (y,X) =  data
...all the current lad code
  return result
end

function lad(setting::RegressionSetting; starting_betas=nothing)
    X = designMatrix(setting)
    y = responseVector(setting)
   result = lad((y,X), starting_betas=starting_betas)
   return result
 end

This would be convenient since the 2nd version (with setting...) can easily call on the first version (with (y,X)). The other way around looks more complicated, but maybe you know how to do it

jbytecode commented 3 years ago

yes, but some other algorithms uses the RegressionSetting object in more than one place and it will be more complicated. Converters from data::Tuple{Vector,Matrix) to RegressionSetting and vice versa should help.

jbytecode commented 3 years ago

what do you think about

convert(RegressionSetting, (y, X)) ?

PaulSoderlind commented 3 years ago

yes, but some other algorithms uses the RegressionSetting object in more than one place and it will be more complicated. Converters from data::Tuple{Vector,Matrix) to RegressionSetting and vice versa should help.

So, does RegressionSetting contain any information that cannot be extracted from (y,X)?

jbytecode commented 3 years ago

No, RegressionSetting includes a formula and a dataset and theoretically a design matrix X and a response vector y perfectly define a linear model.

jbytecode commented 3 years ago

(X, y) class multiple dispatch is implemented for all algorithms except ransac(). Dear @tantei3, please read the implementations of other algorithms carefully and implement the method ransac(X::Array{Float64, 2), y::Array{Float64, 1}, ...) as in the hs93, py95, or ks89. A new data structure OLS is introduced in /src/ols.jl with helper methods residuals(), predict(), coef(), etc(). The ols() and wls() methods are for linear regression and weighted linear regression, respectively and by these implementation we will no need for lm() in package GLM. After adaptation of ransac I will rearrange requirements in LinRegOutliers.jl. Fyi.

jbytecode commented 3 years ago

All of the methods have (X, y) type dispatch now and I am closing this issue. @PaulSoderlind your other contributions are always welcome, thank you for this feature request.

PaulSoderlind commented 3 years ago

thanks