invenia / Impute.jl

Imputation methods for missing data in julia
https://invenia.github.io/Impute.jl/latest/
Other
77 stars 11 forks source link

Switch to Tables.jl API #20

Closed rofinn closed 5 years ago

rofinn commented 5 years ago

Looks like we needed to do some cleanup of the Context type, but otherwise this transition was pretty smooth. It also looks like using the Tables interface has actually improved performance by minimizing data copies. Closes #21 #22 #24 #6

Dataset:

julia> data = dataset("boot", "neuro")
469×6 DataFrame
│ Row │ V1       │ V2       │ V3      │ V4       │ V5       │ V6       │
│     │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1   │ missing  │ -203.7   │ -84.1   │ 18.5     │ missing  │ missing  │
│ 2   │ missing  │ -203.0   │ -97.8   │ 25.8     │ 134.7    │ missing  │
│ 3   │ missing  │ -249.0   │ -92.1   │ 27.8     │ 177.1    │ missing  │
│ 4   │ missing  │ -231.5   │ -97.5   │ 27.0     │ 150.3    │ missing  │
│ 5   │ missing  │ missing  │ -130.1  │ 25.8     │ 160.0    │ missing  │
...
│ 465 │ missing  │ -140.8   │ -38.7   │ 58.1     │ 186.3    │ missing  │
│ 466 │ missing  │ -149.5   │ -40.3   │ 62.8     │ 139.7    │ 242.5    │
│ 467 │ -247.6   │ -157.8   │ -53.3   │ 28.3     │ 122.9    │ 227.6    │
│ 468 │ missing  │ -154.9   │ -50.8   │ 28.1     │ 119.9    │ 201.1    │
│ 469 │ missing  │ -180.7   │ -70.9   │ 33.7     │ 114.8    │ 222.5    │

Original:

julia> result = chain(
           data,
           Impute.Interpolate(),
           Impute.LOCF(),
           Impute.NOCB();
           limit=1.0
       )
┌ Warning: `colwise(f, d::AbstractDataFrame)` is deprecated, use `[f(col) for col = eachcol(d)]` instead.
│   caller = impute!(::Impute.Interpolate, ::Impute.Context, ::DataFrame) at imputors.jl:70
└ @ Impute ~/.playground/share/tmp-impute/depot/packages/Impute/UX99F/src/imputors.jl:70
┌ Warning: `colwise(f, d::AbstractDataFrame)` is deprecated, use `[f(col) for col = eachcol(d)]` instead.
│   caller = impute!(::Impute.LOCF, ::Impute.Context, ::DataFrame) at imputors.jl:70
└ @ Impute ~/.playground/share/tmp-impute/depot/packages/Impute/UX99F/src/imputors.jl:70
┌ Warning: `colwise(f, d::AbstractDataFrame)` is deprecated, use `[f(col) for col = eachcol(d)]` instead.
│   caller = impute!(::Impute.NOCB, ::Impute.Context, ::DataFrame) at imputors.jl:70
└ @ Impute ~/.playground/share/tmp-impute/depot/packages/Impute/UX99F/src/imputors.jl:70
469×6 DataFrame
│ Row │ V1       │ V2       │ V3      │ V4       │ V5       │ V6       │
│     │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1   │ -233.6   │ -203.7   │ -84.1   │ 18.5     │ 134.7    │ 222.7    │
│ 2   │ -233.6   │ -203.0   │ -97.8   │ 25.8     │ 134.7    │ 222.7    │
│ 3   │ -233.6   │ -249.0   │ -92.1   │ 27.8     │ 177.1    │ 222.7    │
│ 4   │ -233.6   │ -231.5   │ -97.5   │ 27.0     │ 150.3    │ 222.7    │
│ 5   │ -233.6   │ -227.3   │ -130.1  │ 25.8     │ 160.0    │ 222.7    │
...
│ 465 │ -239.8   │ -140.8   │ -38.7   │ 58.1     │ 186.3    │ 236.375  │
│ 466 │ -243.7   │ -149.5   │ -40.3   │ 62.8     │ 139.7    │ 242.5    │
│ 467 │ -247.6   │ -157.8   │ -53.3   │ 28.3     │ 122.9    │ 227.6    │
│ 468 │ -247.6   │ -154.9   │ -50.8   │ 28.1     │ 119.9    │ 201.1    │
│ 469 │ -247.6   │ -180.7   │ -70.9   │ 33.7     │ 114.8    │ 222.5    │

julia> @benchmark chain(
           $data,
           Impute.Interpolate(),
           Impute.LOCF(),
           Impute.NOCB();
           limit=1.0
       )
BenchmarkTools.Trial:
  memory estimate:  241.88 KiB
  allocs estimate:  11304
  --------------
  minimum time:     753.368 μs (0.00% GC)
  median time:      767.009 μs (0.00% GC)
  mean time:        815.656 μs (2.97% GC)
  maximum time:     46.291 ms (98.26% GC)
  --------------
  samples:          6114
  evals/sample:     1

New:

julia> result = chain(
           data,
           Impute.Interpolate(),
           Impute.LOCF(),
           Impute.NOCB();
           limit=1.0
       )
469×6 DataFrame
│ Row │ V1       │ V2       │ V3      │ V4       │ V5       │ V6       │
│     │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1   │ -233.6   │ -203.7   │ -84.1   │ 18.5     │ 134.7    │ 222.7    │
│ 2   │ -233.6   │ -203.0   │ -97.8   │ 25.8     │ 134.7    │ 222.7    │
│ 3   │ -233.6   │ -249.0   │ -92.1   │ 27.8     │ 177.1    │ 222.7    │
│ 4   │ -233.6   │ -231.5   │ -97.5   │ 27.0     │ 150.3    │ 222.7    │
│ 5   │ -233.6   │ -227.3   │ -130.1  │ 25.8     │ 160.0    │ 222.7    │
...
│ 465 │ -239.8   │ -140.8   │ -38.7   │ 58.1     │ 186.3    │ 236.375  │
│ 466 │ -243.7   │ -149.5   │ -40.3   │ 62.8     │ 139.7    │ 242.5    │
│ 467 │ -247.6   │ -157.8   │ -53.3   │ 28.3     │ 122.9    │ 227.6    │
│ 468 │ -247.6   │ -154.9   │ -50.8   │ 28.1     │ 119.9    │ 201.1    │
│ 469 │ -247.6   │ -180.7   │ -70.9   │ 33.7     │ 114.8    │ 222.5    │

julia> @benchmark chain(
           $data,
           Impute.Interpolate(),
           Impute.LOCF(),
           Impute.NOCB();
           limit=1.0
       )
BenchmarkTools.Trial:
  memory estimate:  196.75 KiB
  allocs estimate:  10853
  --------------
  minimum time:     307.765 μs (0.00% GC)
  median time:      318.367 μs (0.00% GC)
  mean time:        347.685 μs (4.29% GC)
  maximum time:     45.272 ms (98.96% GC)
  --------------
  samples:          10000
  evals/sample:     1

TODO: Tag new releases of

nalimilan commented 5 years ago

I think we probably should support taking the obsdim as a kwarg. It is what StatsBase is moving towards if i recall discussions with @nalimilan correctly

Unfortunately I don't think we have clearly agreed on the standard keyword argument for this. pairwise in Distances uses dims, like cov and co. in Statistics, but that's not super obvious. So we could use obsdim or vardim, but we need to discuss that somewhere. One issue is that for pairwise the question isn't really where are observations and where are variables, but rather what you want to compute (both could make sense). But maybe that's a special function and in most cases obsdim or vardim is OK.

oxinabox commented 5 years ago

but we have at least settled that where possible it should be a kwarg and not just baked into the function itself. (and mentioned in doc string).

nickrobinson251 commented 5 years ago

fwiw CovarianceEstimation.jl also takes a dims kwarg for specifying observations

rofinn commented 5 years ago

That's true, but the annoying thing is that dims only makes sense when the input is a matrix. I'd also rather not need to pass kwargs to the internal impute functions because our handy functions extract the kwargs into an imputor type internally. If there was some nice pattern for consuming and forwarding kwargs then that might work... even if it is a little inconsistent.

rofinn commented 5 years ago

Based on a recommendation from Curtis I've introduced a vardim kwarg to the Imputor constructors and imputation convenience functions. The vardim field in each Imputor is ignored unless we're operating on matrices.

oxinabox commented 5 years ago

Based on a recommendation from Curtis I've introduced a vardim kwarg to the Imputor constructors and imputation convenience functions. The vardim field in each Imputor is ignored unless we're operating on matrices.

Just for my edification why vardim and not obsdim? I always think in terms of obsdim. (as I said though I don't feel it matters one way or the other as lomg as it is configurable) I guess vardim because most imputation treat each variable as indipendeng? so that is the one that matters? (but on that basis it is also best for each var to stay as a column since that will be most cache friendly)

rofinn commented 5 years ago

I guess vardim because most imputation treat each variable as independent?

More because it's already been used in the stats ecosystem and it reads better to me than obsdim 🤷‍♂ ?

rofinn commented 5 years ago

Alright, I think this is sufficiently better that it warrants another review at some point. Apart from the requested changes from @oxinabox (thanks for review this massive thing!) most of the other changes were just to update the docs to be more consistent and more examples focused.

rofinn commented 5 years ago

API-wise i slightly worry there are too many different ways to do things, and I personally get very confused by |> syntax. I would be very happy with only impute!(::Imputor, data) and will be sad if I need any other syntax.

The main issue with impute!(::Imputor, data) is that it's verbose and the primary reason we're using types is to provide reasonable fallbacks for methods for matrices and tables in the univariate cases. I think it's easy for the |> syntax to be abused, but in this case it's a very explicit way of expressing lazy evaluation of imputation methods. NOTE: You could also write nocb!(locf!(interp(data; context=Context(...)))) ;p

rofinn commented 5 years ago

Since folks seem largely against it I'm just going to remove the matrix orientation deprecation. Any methods using Distributions.jl or MultivariateStats.jl can just use the vardim field to determine if we need to pivot the matrix.

rofinn commented 5 years ago

Okay, I've resolved and applied most of the recommendations. I've commented on things that should be handled in a separate PR and responded if I disagree (e.g., I think |> makes sense here and you can always use explicit function calls if you don't like that).

codecov[bot] commented 5 years ago

Codecov Report

Merging #20 into master will increase coverage by 5.73%. The diff coverage is 98.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #20      +/-   ##
==========================================
+ Coverage   92.07%   97.81%   +5.73%     
==========================================
  Files           9       10       +1     
  Lines         101      183      +82     
==========================================
+ Hits           93      179      +86     
+ Misses          8        4       -4
Impacted Files Coverage Δ
src/imputors/nocb.jl 100% <100%> (ø) :arrow_up:
src/imputors/interp.jl 100% <100%> (ø) :arrow_up:
src/Impute.jl 90% <100%> (+23.33%) :arrow_up:
src/imputors/chain.jl 100% <100%> (ø) :arrow_up:
src/deprecated.jl 100% <100%> (ø)
src/context.jl 100% <100%> (+6.25%) :arrow_up:
src/imputors/locf.jl 100% <100%> (ø) :arrow_up:
src/imputors.jl 100% <100%> (ø) :arrow_up:
src/imputors/fill.jl 100% <100%> (ø) :arrow_up:
src/imputors/drop.jl 92.5% <92.5%> (-7.5%) :arrow_down:
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 255ac57...051d6ce. Read the comment docs.

rofinn commented 5 years ago

@oxinabox Except having lots of comments makes it hard for github to load the page. If your comments can be added to a separate PR I'd appreciate if you made an issue. I also wasn't expecting people to review.

nickrobinson251 commented 5 years ago

I do not know how to approve on Github but if someone points me to the button i'll approve :)

rofinn commented 5 years ago

Alright, since I don't have as strong of an argument for using a pipe to the materializer I've opted to change that. I still have a strong preference for using julia's |> and operators for composing imputation pipelines though.