Closed jbytecode closed 12 months ago
Hi friends @angeris @sbarratt @SaranjeetKaur @fmyilmaz
The main structure almost used in any methods is RegressionSetting and it is defined as
struct RegressionSetting
formula::FormulaTerm
data::DataFrame
end
and here is a sample use for Phones data (Rousseeuw's book).
julia> setting = RegressionSetting(@formula(calls ~ year), phones)
RegressionSetting(calls ~ year, 24×2 DataFrame
│ Row │ year │ calls │
│ │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1 │ 50 │ 4.4 │
│ 2 │ 51 │ 4.7 │
│ 3 │ 52 │ 4.7 │
│ 4 │ 53 │ 5.9 │
│ 5 │ 54 │ 6.6 │
│ 6 │ 55 │ 7.3 │
│ 7 │ 56 │ 8.1 │
│ 8 │ 57 │ 8.8 │
│ 9 │ 58 │ 10.6 │
│ 10 │ 59 │ 12.0 │
⋮
│ 14 │ 63 │ 21.2 │
│ 15 │ 64 │ 119.0 │
│ 16 │ 65 │ 124.0 │
│ 17 │ 66 │ 142.0 │
│ 18 │ 67 │ 159.0 │
│ 19 │ 68 │ 182.0 │
│ 20 │ 69 │ 212.0 │
│ 21 │ 70 │ 43.0 │
│ 22 │ 71 │ 24.0 │
│ 23 │ 72 │ 27.0 │
│ 24 │ 73 │ 29.0 │)
The formula object is imported from GLM and the data is a DataFrame. Since the model is
Calls = intercept + slope * Year + epsilon
the formula is defined as
@formula(calls ~ year)
where Calls is the dependent variable, Year is the independent variable, intercept and slope are regression parameters and epsilon is the error term with zero mean and constant variance.
The design matrix is the matrix of independent variables including the constant term corresponding to the Intercept parameter:
julia> setting = createRegressionSetting(@formula(calls ~ year), phones);
julia> designMatrix(setting)
24×2 Array{Float64,2}:
1.0 50.0
1.0 51.0
1.0 52.0
1.0 53.0
1.0 54.0
1.0 55.0
1.0 56.0
1.0 57.0
1.0 58.0
1.0 59.0
1.0 60.0
1.0 61.0
1.0 62.0
1.0 63.0
1.0 64.0
1.0 65.0
1.0 66.0
1.0 67.0
1.0 68.0
1.0 69.0
1.0 70.0
1.0 71.0
1.0 72.0
1.0 73.0
and the responseVector() extracts the dependent variable from a regression setting:
julia> setting = createRegressionSetting(@formula(calls ~ year), phones);
julia> responseVector(setting)
24-element Array{Float64,1}:
4.4
4.7
4.7
5.9
6.6
7.3
8.1
8.8
10.6
12.0
13.5
14.9
16.1
21.2
119.0
124.0
142.0
159.0
182.0
212.0
43.0
24.0
27.0
29.0
In the package n and p are used for number of observations and number of regression parameters, respectively.
In the package, 3 types of optimizers are used. One of them is NelderMead() from Optim package. NelderMead() optimizes the L1 type objective of the LAD (Least absolute deviations) regression estimator. The others are Compact Genetic Algorithms (/src/cga.jl) and Floating-point Genetic Algorithms (/src/ga.jl), respectivelty. These evolutionary optimizers are used in Satman's (2012) algorithms for LTS (Least Trimmed Squares) estimation.
The Hadi (1992) and MVE (Minimum volume ellipsoid) algorithms differ as they used for detecting outliers in multivariate data rather than the regression models. However, bad-leverage points are considered as multivariate outliers in design space or sometimes some outlier detection procedures for multivariate data are used as a tool in detecting outliers in regression. The MVE & LTS plot is a special example for this. Billor & Chatterjee & Hadi (2006) (/src/bch.jl) is an other example. The other is the dataimage() method. So, to the point, it is important to implement outlier detection methods in multivariate data as well as methods developed directly for linear regression.
Please do not hesitate to ask any other implementation details.
See you soon.
What about the RANSAC method?
@RoyiAvital RANSAC and its derivatives are also welcome. If you want to handle this contribution, we can open a new issue for this.
In the project repo, there is a .dev folder which I use for testing new methods in the development stage because the testing pattern
> using Pkg
> Pkg.activate(".")
> Pkg.test()
takes too much time. But it is very important to modify package's test folder after implementing or refactoring the code.
fyi.
What about the RANSAC method?
What about the RANSAC method?
Ransac is implemented and ready to use
Hi dear friends,
We have also #linregoutliers channel on julia slack. Any details can be discussed there.
@tantei3 @angeris @fmyilmaz @akadal
@fmyilmaz seems to be a newcomer, welcome. we can discuss the package development details here.