rofinn commented 5 years ago

Between changes in the julia ecosystem and new requirements we should probably refactor our design to address specific use cases:

[x] Simple Impute.fill(X), Impute.locf(X), Impute.chain(...) should stay simple
[x] Deprecate impute(X, :method) calls
[x] More functional interface for Chain. Maybe we can have the locf, fill, etc methods default to returning a lazy function if data isn't passed in? This would allow us to write an imputation pipeline as

data = Impute.interp(data; kwargs...) |> Impute.locf(; kwargs...) |> Impute.nocb(; kwargs...)

?

[x] Drop direct dependence on DataFrames by using Tables interface (at the expense of an extra copy) #20
[x] Switch to using JuliaStats matrix orientation by default.
[ ] Introduce an IDataset type which stores original values, missing bitmask, sparse array of imputed values.
[ ] Alternate API where we construct an IDataset from X and pass that to different methods (e.g., @chain, @multiply).
[x] Add support for dropping entire variables if there are too many missing values

NOTE: It's okay if certain imputation methods only work on certain types of data

rofinn commented 5 years ago

I think the best way to handle dropping entire variables is to:

Construct Imputation methods with a Context.
Define DropVars/dropvars and DropObs/dropobs

This would allow you to implement a workflow like:

chain(DropVars(...), Interpolate(...), DropObs(...))

Before we make that changes we should probably:

deprecate the impute(X, :method) functions first
switch the matrix orientation

rofinn commented 5 years ago

As an extension to the above proposed changes we may want to define a separate module for imputation iterators. This would address issues related to mutation inconsistency and Context usage by encapsulating most of the base behaviour in a collection of iterators that don't support mutation and have a reduced API. For more complex cases, we should just use a Dataset type which stores a mask of the missing values with with original and imputed datasets. We can always provide helpful methods for testing missing data patterns, but those will likely require multiple passes over the data anyways.

Iterators

Only makes 1 pass through the data (even with chaining)
Doesn't explicitly make a copy of the data, but also doesn't mutate the underlying data.
Takes a ismissing function
Takes a limit value to error if there are too many missing values

Datasets

Construct missingness masks for original dataset (w/ error conditions)
Impute values and store them in a sparse array
Support complete, merge and analyse operations

rofinn commented 4 years ago

Iterators API didn't work out. Ensuring reasonable performance for even the current list of imputation strategies was challenging and I don't know how much benefit it's likely to have. I think any future efforts would be better served to just simplify the current API, so folks can define their own methods more easily.

invenia / Impute.jl

Refactoring #17

Iterators

Datasets

60