Dealing with missing values

tlienart commented 5 years ago

Probably for a future point:

julia> X = AbstractArray{Union{Float64, Missing}, 2}(randn(5, 7))
julia> X[1, 2] = missing
julia> X[3, 5] = missing
julia> cov(X)
7×7 Array{Union{Missing, Float64},2}:
  0.323781   missing  -0.235777   0.0266937  missing   0.460899   0.345166
   missing   missing    missing    missing   missing    missing    missing
 -0.235777   missing   1.44032   -1.2644     missing   0.39682   -0.442537
  0.0266937  missing  -1.2644     1.69334    missing  -0.367602  -0.374397
   missing   missing    missing    missing   missing    missing    missing
  0.460899   missing   0.39682   -0.367602   missing   1.74075    0.614322
  0.345166   missing  -0.442537  -0.374397   missing   0.614322   2.00857

I don't think that's ideal (using both Statistics and StatsBase). See also covrob r package where a function to filter missing value can be provided.

It would seem pretty easy to at least implement

fail if there are missing
omit if there are missing (remove the corresponding obs)

And then maybe we could suggest imputing maybe via Impute.jl

refs

mateuszbaran commented 5 years ago

There are also algorithms designed specifically to deal with missing data, for example: https://arxiv.org/pdf/1201.2577.pdf .

tlienart commented 5 years ago

Ok so that's a Lasso-type problem on a slightly modified observed covariance (eq (1.5)). I guess that can be added once we've added a (Graphical) Lasso estimator for the covariance.

rumela commented 4 years ago

Consider exporting a shrinkage method that relies on the matrix S, but not the underlying matrix of samples, X (I note that analytical_nonlinear_shrinkage appears to use only S, and not X). The motivation here is that in stock data there are typically missing samples, so a matrix, X, cannot be fully constructed. Instead, pairwise covariances can be calculated to form the elements of a matrix, T (though T is not guaranteed positive semidefinite as its elements are computed on inconsistent data sets).

Then, consider adding the method described here: https://nhigham.com/2013/02/13/the-nearest-correlation-matrix/ (there is already sample code in Matlab/R/Python). Then, T can be "converted" to a positive semidefinite matrix, S, that can then be fed into analytical_nonlinear_shrinkage.

mateuszbaran commented 4 years ago

This looks like a good approach, I could review and merge a pull request that adds this. I don't personally need this functionality at the moment so I'm not going to work on it myself.

mateuszbaran / CovarianceEstimation.jl

Dealing with missing values #52

refs