invenia / Impute.jl

Imputation methods for missing data in julia
https://invenia.github.io/Impute.jl/latest/
Other
77 stars 11 forks source link

Refactoring #17

Closed rofinn closed 4 years ago

rofinn commented 5 years ago

Between changes in the julia ecosystem and new requirements we should probably refactor our design to address specific use cases:

data = Impute.interp(data; kwargs...) |> Impute.locf(; kwargs...) |> Impute.nocb(; kwargs...)

?

NOTE: It's okay if certain imputation methods only work on certain types of data

rofinn commented 5 years ago

I think the best way to handle dropping entire variables is to:

  1. Construct Imputation methods with a Context.
  2. Define DropVars/dropvars and DropObs/dropobs

This would allow you to implement a workflow like:

chain(DropVars(...), Interpolate(...), DropObs(...))

Before we make that changes we should probably:

  1. deprecate the impute(X, :method) functions first
  2. switch the matrix orientation
rofinn commented 5 years ago

As an extension to the above proposed changes we may want to define a separate module for imputation iterators. This would address issues related to mutation inconsistency and Context usage by encapsulating most of the base behaviour in a collection of iterators that don't support mutation and have a reduced API. For more complex cases, we should just use a Dataset type which stores a mask of the missing values with with original and imputed datasets. We can always provide helpful methods for testing missing data patterns, but those will likely require multiple passes over the data anyways.

Iterators

  1. Only makes 1 pass through the data (even with chaining)
  2. Doesn't explicitly make a copy of the data, but also doesn't mutate the underlying data.
  3. Takes a ismissing function
  4. Takes a limit value to error if there are too many missing values

Datasets

  1. Construct missingness masks for original dataset (w/ error conditions)
  2. Impute values and store them in a sparse array
  3. Support complete, merge and analyse operations
rofinn commented 4 years ago

Iterators API didn't work out. Ensuring reasonable performance for even the current list of imputation strategies was challenging and I don't know how much benefit it's likely to have. I think any future efforts would be better served to just simplify the current API, so folks can define their own methods more easily.

60