invenia / Impute.jl

Imputation methods for missing data in julia
https://invenia.github.io/Impute.jl/latest/
Other
76 stars 11 forks source link

kNN Imputation #54

Closed appleparan closed 4 years ago

appleparan commented 4 years ago

Inspired by SVD imputation (#16), I implemented KNN imputation which closes #4

Reference

codecov[bot] commented 4 years ago

Codecov Report

Merging #54 into master will decrease coverage by 0.71%. The diff coverage is 88.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
- Coverage   96.69%   95.97%   -0.72%     
==========================================
  Files          12       13       +1     
  Lines         272      298      +26     
==========================================
+ Hits          263      286      +23     
- Misses          9       12       +3
Impacted Files Coverage Δ
src/Impute.jl 57.14% <0%> (-6.02%) :arrow_down:
src/imputors.jl 100% <100%> (ø) :arrow_up:
src/imputors/knn.jl 95.83% <95.83%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4546b9a...86d0597. Read the comment docs.

appleparan commented 4 years ago

I have no idea why build fails in CI on 1.0. Does anyone helps me? Updating Manifest solved this problem.

appleparan commented 4 years ago

In my opinion, broken tests should be solved in another PR.

rofinn commented 4 years ago

Since you seem to be comparing Impute.jl against R anyways, would you mind using RCall (or PyCall) to compare your code against an existing implementation (e.g., https://www.rdocumentation.org/packages/bnstruct/versions/1.0.6/topics/knn.impute, https://github.com/iskandr/fancyimpute)?

appleparan commented 4 years ago

I will add RCall Later because this could be applied not only kNN, but SVD. We need to merge SVD and kNN implementation first, then making tests more structural, and adding RCall should be the last one.

rofinn commented 4 years ago

Okay, I've merged the svd implementation, if you want to rebase. I'm guessing most of the conflicts will get automerged.

appleparan commented 4 years ago

Okay, I merged with master. I will rebase it. Thank you so much for merging other branches.

appleparan commented 4 years ago

I rebased it into single commit. After passing CI, you can merge it.

appleparan commented 4 years ago

I rewrited code, however, I found some problems.

  1. Tests are unreliable. Sometimes it failed. Therefore, I ran tests multiple times (x100) and then averaged results.
  2. Added tests based on iris dataset.
    • Ref : P. Schimitt, et. al, A comparison of six methods for missing data imputation
  3. I changed order of original data and imputed data on nrmsd to make dividend for normalising to be always same.
  4. Test with random variables I'm not sure we need this. This doesn't guarantee knn imputation always be better than mean imputation due to randomness. What about to use other datasets such as some of in RDatasets?
  5. Mean imputation doesn't fill all missing values sometimes. It happens iris dataset usually. I still not figured out what's the problem (my test code or mean(Fill) imputation itself)
rofinn commented 4 years ago

Mean imputation doesn't fill all missing values sometimes. It happens iris dataset usually. I still not figured out what's the problem (my test code or mean(Fill) imputation itself)

Can you provide a reproducible example for this?

Test with random variables I'm not sure we need this. This doesn't guarantee knn imputation always be better than mean imputation due to randomness. What about to use other datasets such as some of in RDatasets?

That's correct, the random variable tests from svd was to confirm that poor low rank approximations wouldn't cause the method to fail. In fact, the test confirms that poor performance is expected when there are just a couple of uncorrelated variables (e.g., @test knn_nrmsd > mean_nrmsd * 0.9 the 0.9 was just in case of runtime variability).

appleparan commented 4 years ago

Mean imputation doesn't fill all missing values sometimes. It happens iris dataset usually. I still not figured out what's the problem (my test code or mean(Fill) imputation itself)

I figured out why. Mean imputation to transposed array is wrong. First, it gives wrong result. Second, if doing imputation on small number of columns like iris dataset and all columns are missing, mean value also be missing.

I fixed most of minor things. As you pointed out, float.(collect()) is bad part of my commit. I thought this would be needed for KDTree, but your suggestion is right. I don't need that. I will fix them and commit again.

appleparan commented 4 years ago

Travis was down. Would you retrigger?

If you feel okay, please let me know. I will rebase it.

appleparan commented 4 years ago

Test fails due to #61

rofinn commented 4 years ago

Yeah, we should also drop appveyor anyways.