invenia / Impute.jl

Imputation methods for missing data in julia
77 stars 11 forks source link


stable latest CI codecov

Impute.jl provides various methods for handling missing data in Vectors, Matrices and Tables.


julia> using Pkg; Pkg.add("Impute")


Let's start by loading our dependencies:

julia> using DataFrames, Impute

We'll also want some test data containing missings to work with:

julia> df = Impute.dataset("test/table/neuro") |> DataFrame
469×6 DataFrame
 Row │ V1         V2         V3       V4        V5         V6
     │ Float64?   Float64?   Float64  Float64?  Float64?   Float64?
   1 │ missing       -203.7    -84.1      18.5  missing    missing
   2 │ missing       -203.0    -97.8      25.8      134.7  missing
   3 │ missing       -249.0    -92.1      27.8      177.1  missing
   4 │ missing       -231.5    -97.5      27.0      150.3  missing
   5 │ missing    missing     -130.1      25.8      160.0  missing
   6 │ missing       -223.1    -70.7      62.1      197.5  missing
   7 │ missing       -164.8    -12.2      76.8      202.8  missing
   8 │ missing       -221.6    -81.9      27.5      144.5  missing
  ⋮  │     ⋮          ⋮         ⋮        ⋮          ⋮          ⋮
 463 │    -242.6     -142.0    -21.8      69.8      148.7  missing
 464 │    -235.9     -128.8    -33.1      68.8      177.1  missing
 465 │ missing       -140.8    -38.7      58.1      186.3  missing
 466 │ missing       -149.5    -40.3      62.8      139.7      242.5
 467 │    -247.6     -157.8    -53.3      28.3      122.9      227.6
 468 │ missing       -154.9    -50.8      28.1      119.9      201.1
 469 │ missing       -180.7    -70.9      33.7      114.8      222.5
                                                     454 rows omitted

Our first instinct might be to drop all observations, but this leaves us too few rows to work with:

julia> Impute.filter(df; dims=:rows)
4×6 DataFrame
 Row │ V1       V2       V3       V4       V5       V6
     │ Float64  Float64  Float64  Float64  Float64  Float64
   1 │  -247.0   -132.2    -18.8     28.2     81.4    237.9
   2 │  -234.0   -140.8    -56.5     28.0    114.3    222.9
   3 │  -215.8   -114.8    -18.4     65.3    171.6    249.7
   4 │  -247.6   -157.8    -53.3     28.3    122.9    227.6

We could try imputing the values with linear interpolation, but that still leaves missing data at the head and tail of our dataset:

julia> Impute.interp(df)
469×6 DataFrame
 Row │ V1           V2         V3       V4        V5         V6
     │ Float64?     Float64?   Float64  Float64?  Float64?   Float64?
   1 │ missing        -203.7     -84.1      18.5  missing    missing
   2 │ missing        -203.0     -97.8      25.8      134.7  missing
   3 │ missing        -249.0     -92.1      27.8      177.1  missing
   4 │ missing        -231.5     -97.5      27.0      150.3  missing
   5 │ missing        -227.3    -130.1      25.8      160.0  missing
   6 │ missing        -223.1     -70.7      62.1      197.5  missing
   7 │ missing        -164.8     -12.2      76.8      202.8  missing
   8 │ missing        -221.6     -81.9      27.5      144.5  missing
  ⋮  │      ⋮           ⋮         ⋮        ⋮          ⋮           ⋮
 463 │    -242.6      -142.0     -21.8      69.8      148.7      224.125
 464 │    -235.9      -128.8     -33.1      68.8      177.1      230.25
 465 │    -239.8      -140.8     -38.7      58.1      186.3      236.375
 466 │    -243.7      -149.5     -40.3      62.8      139.7      242.5
 467 │    -247.6      -157.8     -53.3      28.3      122.9      227.6
 468 │ missing        -154.9     -50.8      28.1      119.9      201.1
 469 │ missing        -180.7     -70.9      33.7      114.8      222.5
                                                         454 rows omitted

Finally, we can chain multiple simple methods together to give a complete dataset:

julia> Impute.interp(df) |> Impute.locf() |> Impute.nocb()
469×6 DataFrame
 Row │ V1        V2         V3       V4        V5        V6
     │ Float64?  Float64?   Float64  Float64?  Float64?  Float64?
   1 │ -233.6      -203.7     -84.1      18.5     134.7   222.7
   2 │ -233.6      -203.0     -97.8      25.8     134.7   222.7
   3 │ -233.6      -249.0     -92.1      27.8     177.1   222.7
   4 │ -233.6      -231.5     -97.5      27.0     150.3   222.7
   5 │ -233.6      -227.3    -130.1      25.8     160.0   222.7
   6 │ -233.6      -223.1     -70.7      62.1     197.5   222.7
   7 │ -233.6      -164.8     -12.2      76.8     202.8   222.7
   8 │ -233.6      -221.6     -81.9      27.5     144.5   222.7
  ⋮  │    ⋮          ⋮         ⋮        ⋮         ⋮         ⋮
 463 │ -242.6      -142.0     -21.8      69.8     148.7   224.125
 464 │ -235.9      -128.8     -33.1      68.8     177.1   230.25
 465 │ -239.8      -140.8     -38.7      58.1     186.3   236.375
 466 │ -243.7      -149.5     -40.3      62.8     139.7   242.5
 467 │ -247.6      -157.8     -53.3      28.3     122.9   227.6
 468 │ -247.6      -154.9     -50.8      28.1     119.9   201.1
 469 │ -247.6      -180.7     -70.9      33.7     114.8   222.5
                                                  454 rows omitted
