invenia / Impute.jl

Imputation methods for missing data in julia
https://invenia.github.io/Impute.jl/latest/
Other
76 stars 11 forks source link

API simplification #66

Open rofinn opened 3 years ago

rofinn commented 3 years ago

Overview

The current implementation has some nice features for handling iterative data and provides early exit conditions. Unfortunately, these features are harder to maintain as we need to handle more use cases and different data structures. A couple of examples of this include:

Proposed Changes

  1. Drop AbstractContext and maybe replace it with some or all of the below: - Impute.replace!: which will handle the allowmissing call and could support replacing values in multiple columns at once. - Impute.assert?: if you want to throw an error if some missing data threshold is reached [trivial in most cases] - Impute.mask?: will just give you a binary mask over your input data [trivial in most cases]
  2. Add an Impute.filter option which will filter observations base on some threshold. Along a dims would probably be more general. This is also probably more general than dropobs and dropvars?
  3. Drop public API support for in-place imputation rather than having undefined behaviour
  4. Add a Tables.jl like interface for describing whether imputation methods support vectors, matrices, tables or arbitrary iterators of some eltype. This would help us produce better error messages when someone uses an invalid imputation methods rather than giving a long traceback to a rather opaque method error.
  5. Add a MCAR test check
  6. Add a data generation module which would be helpful for tests, but could also be used for method comparisons.

Out of scope

Success Conditions

  1. Easier to write new imputation methods
  2. No undefined behaviour
  3. More composable API
  4. Shouldn't be slower than existing tooling (e.g., replace, filter)

Failure Conditions

  1. It isn't much easier to write new imputation methods (e.g., similar lines of code, about as readable) and we've also gained the trade-offs below
  2. Trade-offs are severe in even the simplest of cases. Existing test cases should be used as benchmarks for checking this.
  3. Most operations can be replace with an existing method
  4. Error conditions are harder to debug

Trade-offs

Related Issues & PRs

rofinn commented 3 years ago

I think proposed changes 3-6 shouldn't be breaking, so I'll bump the remainder of this issue to the 1.0 release.

bencottier commented 3 years ago

Following https://github.com/invenia/AxisSets.jl/pull/44#discussion_r614292489:

Ideally we would standardise how to handle dims and AxisSets.Patterns across Impute.jl, FeatureTransforms.jl, and AxisSets.jl, in which the former two are both supported.

I think the main difference in handling is that FeatureTransforms.jl supports dims=:, which means apply a transform element-wise over an array.

We should also use a consistent convention for what e.g. dims=1 means (outdated but relevant issue: https://github.com/invenia/FeatureTransforms.jl/issues/18)