Closed rofinn closed 5 years ago
I think we probably should support taking the
obsdim
as a kwarg. It is whatStatsBase
is moving towards if i recall discussions with @nalimilan correctly
Unfortunately I don't think we have clearly agreed on the standard keyword argument for this. pairwise
in Distances uses dims
, like cov
and co. in Statistics, but that's not super obvious. So we could use obsdim
or vardim
, but we need to discuss that somewhere. One issue is that for pairwise
the question isn't really where are observations and where are variables, but rather what you want to compute (both could make sense). But maybe that's a special function and in most cases obsdim
or vardim
is OK.
but we have at least settled that where possible it should be a kwarg and not just baked into the function itself. (and mentioned in doc string).
fwiw CovarianceEstimation.jl also takes a dims
kwarg for specifying observations
That's true, but the annoying thing is that dims
only makes sense when the input is a matrix. I'd also rather not need to pass kwargs to the internal impute
functions because our handy functions extract the kwargs into an imputor type internally. If there was some nice pattern for consuming and forwarding kwargs then that might work... even if it is a little inconsistent.
Based on a recommendation from Curtis I've introduced a vardim
kwarg to the Imputor
constructors and imputation convenience functions. The vardim
field in each Imputor is ignored unless we're operating on matrices.
Based on a recommendation from Curtis I've introduced a vardim kwarg to the Imputor constructors and imputation convenience functions. The vardim field in each Imputor is ignored unless we're operating on matrices.
Just for my edification why vardim
and not obsdim
?
I always think in terms of obsdim
.
(as I said though I don't feel it matters one way or the other as lomg as it is configurable)
I guess vardim
because most imputation treat each variable as indipendeng?
so that is the one that matters?
(but on that basis it is also best for each var to stay as a column since that will be most cache friendly)
I guess vardim because most imputation treat each variable as independent?
More because it's already been used in the stats ecosystem and it reads better to me than obsdim
🤷♂ ?
Alright, I think this is sufficiently better that it warrants another review at some point. Apart from the requested changes from @oxinabox (thanks for review this massive thing!) most of the other changes were just to update the docs to be more consistent and more examples focused.
API-wise i slightly worry there are too many different ways to do things, and I personally get very confused by |> syntax. I would be very happy with only impute!(::Imputor, data) and will be sad if I need any other syntax.
The main issue with impute!(::Imputor, data)
is that it's verbose and the primary reason we're using types is to provide reasonable fallbacks for methods for matrices and tables in the univariate cases. I think it's easy for the |>
syntax to be abused, but in this case it's a very explicit way of expressing lazy evaluation of imputation methods. NOTE: You could also write nocb!(locf!(interp(data; context=Context(...))))
;p
Since folks seem largely against it I'm just going to remove the matrix orientation deprecation. Any methods using Distributions.jl or MultivariateStats.jl can just use the vardim
field to determine if we need to pivot the matrix.
Okay, I've resolved and applied most of the recommendations. I've commented on things that should be handled in a separate PR and responded if I disagree (e.g., I think |>
makes sense here and you can always use explicit function calls if you don't like that).
Merging #20 into master will increase coverage by
5.73%
. The diff coverage is98.22%
.
@@ Coverage Diff @@
## master #20 +/- ##
==========================================
+ Coverage 92.07% 97.81% +5.73%
==========================================
Files 9 10 +1
Lines 101 183 +82
==========================================
+ Hits 93 179 +86
+ Misses 8 4 -4
Impacted Files | Coverage Δ | |
---|---|---|
src/imputors/nocb.jl | 100% <100%> (ø) |
:arrow_up: |
src/imputors/interp.jl | 100% <100%> (ø) |
:arrow_up: |
src/Impute.jl | 90% <100%> (+23.33%) |
:arrow_up: |
src/imputors/chain.jl | 100% <100%> (ø) |
:arrow_up: |
src/deprecated.jl | 100% <100%> (ø) |
|
src/context.jl | 100% <100%> (+6.25%) |
:arrow_up: |
src/imputors/locf.jl | 100% <100%> (ø) |
:arrow_up: |
src/imputors.jl | 100% <100%> (ø) |
:arrow_up: |
src/imputors/fill.jl | 100% <100%> (ø) |
:arrow_up: |
src/imputors/drop.jl | 92.5% <92.5%> (-7.5%) |
:arrow_down: |
... and 1 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 255ac57...051d6ce. Read the comment docs.
@oxinabox Except having lots of comments makes it hard for github to load the page. If your comments can be added to a separate PR I'd appreciate if you made an issue. I also wasn't expecting people to review.
I do not know how to approve on Github but if someone points me to the button i'll approve :)
Alright, since I don't have as strong of an argument for using a pipe to the materializer
I've opted to change that. I still have a strong preference for using julia's |>
and ∘
operators for composing imputation pipelines though.
Looks like we needed to do some cleanup of the
Context
type, but otherwise this transition was pretty smooth. It also looks like using the Tables interface has actually improved performance by minimizing data copies. Closes #21 #22 #24 #6Dataset:
Original:
New:
TODO: Tag new releases of