invenia / Impute.jl

Imputation methods for missing data in julia
https://invenia.github.io/Impute.jl/latest/
Other
76 stars 11 forks source link

Use `skipmissing` instead of `drop` in `fill`? #50

Closed nickrobinson251 closed 3 years ago

rofinn commented 5 years ago

skipmissing wouldn't work if we change the missingness function (e.g., isnan, x -> x == 999999). Many other datasets use different sentinel values.

nickrobinson251 commented 5 years ago

ah, i didn't realise it was a goal of the package to impute non-missing values. If that's not documented, perhaps it'd be worth adding somewhere?

Feel free to close this :)

rofinn commented 5 years ago

It's only really documented for the Context type and isn't currently used in any examples. I'll leave this open till that's added. If we move in the direction of having an Impute.Iterators module then the behaviour of Impute.Iterators.drop and skipmissing should become almost identical in the base case. Might be a good thing to test against though :)

nickrobinson251 commented 5 years ago

skipmissing wouldn't work if we change the missingness function

The more I think about this, the less I like.

Julia gives us missing, which is a monumentally useful thing, and I think it is reasonable to build for that case. If users need to replace(X, 999999 => missing) then they can do that before imputing

rofinn commented 5 years ago

Agreed. That’s why I’d like to move the current behaviour to an iterators module and default to using a multipass approach with an Impute.Dataset type. I'll note that most of these design decisions were made when Missing and Nullable we’re both things, which is less relevant now that julia provides missing by default.

A couple notes on how I think this should exist in the Impute.Iters API.

  1. I feel like if it isn't hard to continue supporting arbitrary missingness functions I don't see a reason not to support it in the iterators interface as it allows you to perform these operations in a single pass through the data rather than requiring multiple passes.
  2. In the case of fill, it probably shouldn't be applying a function over all of the non-missing data in the interator interface and should instead be using something like an OnlineStat if a single pass is the desire behaviour. If you're willing to do multiple passes then just manually create an Impute.Dataset type with a custom mask.
rofinn commented 3 years ago

That's exactly what's happing in the new Impute.substitute call introduced in #69

https://github.com/invenia/Impute.jl/blob/master/src/imputors/substitute.jl#L51