Closed rofinn closed 4 years ago
I think the best way to handle dropping entire variables is to:
DropVars
/dropvars
and DropObs
/dropobs
This would allow you to implement a workflow like:
chain(DropVars(...), Interpolate(...), DropObs(...))
Before we make that changes we should probably:
impute(X, :method)
functions firstAs an extension to the above proposed changes we may want to define a separate module for imputation iterators. This would address issues related to mutation inconsistency and Context
usage by encapsulating most of the base behaviour in a collection of iterators that don't support mutation and have a reduced API. For more complex cases, we should just use a Dataset
type which stores a mask of the missing values with with original and imputed datasets. We can always provide helpful methods for testing missing data patterns, but those will likely require multiple passes over the data anyways.
ismissing
functionlimit
value to error if there are too many missing valuescomplete
, merge
and analyse
operationsIterators API didn't work out. Ensuring reasonable performance for even the current list of imputation strategies was challenging and I don't know how much benefit it's likely to have. I think any future efforts would be better served to just simplify the current API, so folks can define their own methods more easily.
Between changes in the julia ecosystem and new requirements we should probably refactor our design to address specific use cases:
Impute.fill(X)
,Impute.locf(X)
,Impute.chain(...)
should stay simpleimpute(X, :method)
callsChain
. Maybe we can have thelocf
,fill
, etc methods default to returning a lazy function if data isn't passed in? This would allow us to write an imputation pipeline as?
X
and pass that to different methods (e.g.,@chain
,@multiply
).NOTE: It's okay if certain imputation methods only work on certain types of data