invenia / Impute.jl

Imputation methods for missing data in julia
https://invenia.github.io/Impute.jl/latest/
Other
77 stars 11 forks source link

KNN imputor fixes #83

Closed rofinn closed 3 years ago

rofinn commented 3 years ago
  1. Fixes bug with fallback check for missing neighbors using ismissing on impute data, so it would never identify missing neighbors. Example, https://codecov.io/gh/invenia/Impute.jl/src/2f64a27010480692aff4077792e92ce5b5c01bc0/src/imputors/knn.jl
  2. Missing neighbors are only those that have a missing value that's of interest to us.
  3. Inverse distance weighting is simplified to the weighted mean using weights from StatsBase.
  4. Continues to build the KDTree from full dataset, but reduces points searched for to those observations containing missings.
  5. Closes #73 by running simulated Iris missing patterns 1000 times (similar to referenced paper) and using Welch's t-test to determine "significance" 🤞

The end result is that the iris dataset tests aren't much better, but the new code allocates less memory, is faster and performs and order of magnitude better on the "Data match" tests that we expect it to do well on.

Before:

knn_nrmsd = 0.02351386140235716
mean_nrmsd = 0.025655653575507923
knn_nrmsd = 0.03454405384070915
mean_nrmsd = 0.036899643641953625
knn_nrmsd = 0.04029575884706388
mean_nrmsd = 0.042321570291156005
knn_nrmsd = 0.00270435447793449
mean_nrmsd = 0.008769967721453583

After:

knn_nrmsd = 0.034385695023072295
mean_nrmsd = 0.03848092187638184
knn_nrmsd = 0.04448270841439509
mean_nrmsd = 0.04616687082955499
knn_nrmsd = 0.05125571112635573
mean_nrmsd = 0.05261488365411982
knn_nrmsd = 0.00037746930884156415
mean_nrmsd = 0.004039809809703139

Before:

julia> @benchmark Impute.knn($X; k=4, dims=:rows)
BenchmarkTools.Trial:
  memory estimate:  200.77 MiB
  allocs estimate:  1095761
  --------------
  minimum time:     147.330 ms (12.08% GC)
  median time:      157.492 ms (18.28% GC)
  mean time:        168.883 ms (23.04% GC)
  maximum time:     323.553 ms (57.92% GC)
  --------------
  samples:          30
  evals/sample:     1

After:

julia> @benchmark Impute.knn($X; k=4, dims=:rows)
BenchmarkTools.Trial:
  memory estimate:  33.29 MiB
  allocs estimate:  711148
  --------------
  minimum time:     57.409 ms (0.00% GC)
  median time:      63.676 ms (5.84% GC)
  mean time:        65.398 ms (4.66% GC)
  maximum time:     90.936 ms (5.13% GC)
  --------------
  samples:          77
  evals/sample:     1
rofinn commented 3 years ago

@appleparan This may be relevant to you ☝️

rofinn commented 3 years ago

All review comments applied except from @view/@views since there wasn't a clear performance benefit during quick benchmarking. I'll merge when tests pass.

appleparan commented 3 years ago

That's good! Thanks!