mateuszbaran / CovarianceEstimation.jl

Lightweight robust covariance estimation in Julia
MIT License
42 stars 7 forks source link

Dealing with missing values #52

Open tlienart opened 5 years ago

tlienart commented 5 years ago

Probably for a future point:

julia> X = AbstractArray{Union{Float64, Missing}, 2}(randn(5, 7))
julia> X[1, 2] = missing
julia> X[3, 5] = missing
julia> cov(X)
7×7 Array{Union{Missing, Float64},2}:
  0.323781   missing  -0.235777   0.0266937  missing   0.460899   0.345166
   missing   missing    missing    missing   missing    missing    missing
 -0.235777   missing   1.44032   -1.2644     missing   0.39682   -0.442537
  0.0266937  missing  -1.2644     1.69334    missing  -0.367602  -0.374397
   missing   missing    missing    missing   missing    missing    missing
  0.460899   missing   0.39682   -0.367602   missing   1.74075    0.614322
  0.345166   missing  -0.442537  -0.374397   missing   0.614322   2.00857 

I don't think that's ideal (using both Statistics and StatsBase). See also covrob r package where a function to filter missing value can be provided.

It would seem pretty easy to at least implement

And then maybe we could suggest imputing maybe via Impute.jl

refs

mateuszbaran commented 5 years ago

There are also algorithms designed specifically to deal with missing data, for example: https://arxiv.org/pdf/1201.2577.pdf .

tlienart commented 5 years ago

Ok so that's a Lasso-type problem on a slightly modified observed covariance (eq (1.5)). I guess that can be added once we've added a (Graphical) Lasso estimator for the covariance.

rumela commented 4 years ago

Consider exporting a shrinkage method that relies on the matrix S, but not the underlying matrix of samples, X (I note that analytical_nonlinear_shrinkage appears to use only S, and not X). The motivation here is that in stock data there are typically missing samples, so a matrix, X, cannot be fully constructed. Instead, pairwise covariances can be calculated to form the elements of a matrix, T (though T is not guaranteed positive semidefinite as its elements are computed on inconsistent data sets).

Then, consider adding the method described here: https://nhigham.com/2013/02/13/the-nearest-correlation-matrix/ (there is already sample code in Matlab/R/Python). Then, T can be "converted" to a positive semidefinite matrix, S, that can then be fed into analytical_nonlinear_shrinkage.

mateuszbaran commented 4 years ago

This looks like a good approach, I could review and merge a pull request that adds this. I don't personally need this functionality at the moment so I'm not going to work on it myself.