jbytecode / LinRegOutliers

Direct and robust methods for outlier detection in linear regression
MIT License
43 stars 6 forks source link

instructions to package contributors #3

Closed jbytecode closed 12 months ago

jbytecode commented 3 years ago

@fmyilmaz seems to be a newcomer, welcome. we can discuss the package development details here.

jbytecode commented 3 years ago

Hi friends @angeris @sbarratt @SaranjeetKaur @fmyilmaz

The main structure almost used in any methods is RegressionSetting and it is defined as

struct RegressionSetting
    formula::FormulaTerm
    data::DataFrame
end

and here is a sample use for Phones data (Rousseeuw's book).

julia> setting = RegressionSetting(@formula(calls ~ year), phones)
RegressionSetting(calls ~ year, 24×2 DataFrame
│ Row │ year  │ calls   │
│     │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1   │ 50    │ 4.4     │
│ 2   │ 51    │ 4.7     │
│ 3   │ 52    │ 4.7     │
│ 4   │ 53    │ 5.9     │
│ 5   │ 54    │ 6.6     │
│ 6   │ 55    │ 7.3     │
│ 7   │ 56    │ 8.1     │
│ 8   │ 57    │ 8.8     │
│ 9   │ 58    │ 10.6    │
│ 10  │ 59    │ 12.0    │
⋮
│ 14  │ 63    │ 21.2    │
│ 15  │ 64    │ 119.0   │
│ 16  │ 65    │ 124.0   │
│ 17  │ 66    │ 142.0   │
│ 18  │ 67    │ 159.0   │
│ 19  │ 68    │ 182.0   │
│ 20  │ 69    │ 212.0   │
│ 21  │ 70    │ 43.0    │
│ 22  │ 71    │ 24.0    │
│ 23  │ 72    │ 27.0    │
│ 24  │ 73    │ 29.0    │)

The formula object is imported from GLM and the data is a DataFrame. Since the model is

Calls = intercept + slope * Year + epsilon

the formula is defined as

@formula(calls ~ year)

where Calls is the dependent variable, Year is the independent variable, intercept and slope are regression parameters and epsilon is the error term with zero mean and constant variance.

The design matrix is the matrix of independent variables including the constant term corresponding to the Intercept parameter:

julia> setting = createRegressionSetting(@formula(calls ~ year), phones);
julia> designMatrix(setting)
24×2 Array{Float64,2}:
 1.0  50.0
 1.0  51.0
 1.0  52.0
 1.0  53.0
 1.0  54.0
 1.0  55.0
 1.0  56.0
 1.0  57.0
 1.0  58.0
 1.0  59.0
 1.0  60.0
 1.0  61.0
 1.0  62.0
 1.0  63.0
 1.0  64.0
 1.0  65.0
 1.0  66.0
 1.0  67.0
 1.0  68.0
 1.0  69.0
 1.0  70.0
 1.0  71.0
 1.0  72.0
 1.0  73.0

and the responseVector() extracts the dependent variable from a regression setting:

julia> setting = createRegressionSetting(@formula(calls ~ year), phones);
julia> responseVector(setting)
24-element Array{Float64,1}:
   4.4
   4.7
   4.7
   5.9
   6.6
   7.3
   8.1
   8.8
  10.6
  12.0
  13.5
  14.9
  16.1
  21.2
 119.0
 124.0
 142.0
 159.0
 182.0
 212.0
  43.0
  24.0
  27.0
  29.0

In the package n and p are used for number of observations and number of regression parameters, respectively.

In the package, 3 types of optimizers are used. One of them is NelderMead() from Optim package. NelderMead() optimizes the L1 type objective of the LAD (Least absolute deviations) regression estimator. The others are Compact Genetic Algorithms (/src/cga.jl) and Floating-point Genetic Algorithms (/src/ga.jl), respectivelty. These evolutionary optimizers are used in Satman's (2012) algorithms for LTS (Least Trimmed Squares) estimation.

The Hadi (1992) and MVE (Minimum volume ellipsoid) algorithms differ as they used for detecting outliers in multivariate data rather than the regression models. However, bad-leverage points are considered as multivariate outliers in design space or sometimes some outlier detection procedures for multivariate data are used as a tool in detecting outliers in regression. The MVE & LTS plot is a special example for this. Billor & Chatterjee & Hadi (2006) (/src/bch.jl) is an other example. The other is the dataimage() method. So, to the point, it is important to implement outlier detection methods in multivariate data as well as methods developed directly for linear regression.

Please do not hesitate to ask any other implementation details.

See you soon.

RoyiAvital commented 3 years ago

What about the RANSAC method?

jbytecode commented 3 years ago

@RoyiAvital RANSAC and its derivatives are also welcome. If you want to handle this contribution, we can open a new issue for this.

jbytecode commented 3 years ago

In the project repo, there is a .dev folder which I use for testing new methods in the development stage because the testing pattern

> using Pkg
> Pkg.activate(".")
> Pkg.test()

takes too much time. But it is very important to modify package's test folder after implementing or refactoring the code.

fyi.

jbytecode commented 3 years ago

Versioning: https://github.com/jbytecode/LinRegOutliers/issues/16

jbytecode commented 3 years ago

What about the RANSAC method?

What about the RANSAC method?

Ransac is implemented and ready to use

jbytecode commented 3 years ago

Hi dear friends,

We have also #linregoutliers channel on julia slack. Any details can be discussed there.

@tantei3 @angeris @fmyilmaz @akadal