invenia / Impute.jl

Imputation methods for missing data in julia
https://invenia.github.io/Impute.jl/latest/
Other
76 stars 11 forks source link

SVD Imputation - Inexact Error #135

Closed RaSi96 closed 1 year ago

RaSi96 commented 1 year ago

Hi all,

Apologies for the bother but I was quite curious about the SVD method provided by this package and decided to try it out on a seemingly harmless dataframe:

df = DataFrame(:a => [2, 3, missing, 5], :b => [missing, 9, 16, 25])
4×2 DataFrame
Row │ a        b
    │ Int64?   Int64?
────┼──────────────────
1   │       2  missing
2   │       3        9
3   │ missing       16
4   │       5       25

Where :b is intended to be :a.^2 (the square of :a). After getting a couple of errors trying to use the raw dataframe, I came across #128 and tried using a matrix-converted dataframe, but interestingly ended up with this error:

julia> Impute.svd(Matrix(df))
ERROR: InexactError: Int64(35.5287074347614)
Stacktrace:
[1] Int64
@ ./float.jl:788 [inlined]
[2] convert
@ ./number.jl:7 [inlined]
[3] setindex!
@ ./array.jl:966 [inlined]
[4] setindex!
@ ./subarray.jl:347 [inlined]
[5] copyto_unaliased!(deststyle::IndexLinear, dest::SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}, srcstyle::IndexLinear, src::Vector{Float64})
@ Base ./abstractarray.jl:1038
[6] copyto!
@ ./abstractarray.jl:1018 [inlined]
[7] copyto!
@ ./broadcast.jl:954 [inlined]
[8] copyto!
@ ./broadcast.jl:913 [inlined]
[9] materialize!
@ ./broadcast.jl:871 [inlined]
[10] materialize!
@ ./broadcast.jl:868 [inlined]
[11] impute!(data::Matrix{Union{Missing, Int64}}, imp::Impute.SVD; dims::Nothing)
@ Impute ~/.julia/packages/Impute/vw7rh/src/imputors/svd.jl:64
[12] impute!
@ ~/.julia/packages/Impute/vw7rh/src/imputors/svd.jl:33 [inlined]
[13] #impute#32
@ ~/.julia/packages/Impute/vw7rh/src/imputors/svd.jl:97 [inlined]
[14] impute
@ ~/.julia/packages/Impute/vw7rh/src/imputors/svd.jl:96 [inlined]
[15] svd(data::Matrix{Union{Missing, Int64}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Impute ~/.julia/packages/Impute/vw7rh/src/functional.jl:76
[16] svd(data::Matrix{Union{Missing, Int64}})
@ Impute ~/.julia/packages/Impute/vw7rh/src/functional.jl:74
[17] top-level scope
@ REPL[8]:1

From what I gleaned from a quick Google-Fu (2 separate links), somewhere in the chain of function calls there maybe isn't a round-off check? I think it might be due to the small size of my test dataframe, but I can't be sure because I also tried using it with Turing.jl's Bayesian Linear Regression for imputation purposes and it worked (to some degree of accuracy). Is there anything wrong here or have I overlooked something?

Thanks for your time!

rofinn commented 1 year ago

This is likely because Impute.jl is careful not to change your element type for two reasons:

  1. Some imputation methods may fail to impute all values (LOCF/NOCB at the ends)
  2. We don't want to accidentally change the precision on you (Impute.jl should respect if you want to operate on Float32s)

My guess is that Turing.jl is converting your ints to floats for you. With Impute.jl you just need to be explicit about what you want.

julia> Impute.svd(Matrix{Union{Float64, Missing}}(df))
4×2 Matrix{Union{Missing, Float64}}:
 2.0       9.43342
 3.0       9.0
 3.44377  16.0
 5.0      25.0
RaSi96 commented 1 year ago

Hi @rofinn, thanks for responding! Apologies for the late reply, I somehow completely missed being notified about this. Your explanation makes a lot of sense; I tried it again using floats only and it worked:

julia> a = DataFrame(:a => [2.0, 3.0, missing, 5.0], :b => [missing, 9.0, 16, 25.0])
4×2 DataFrame
Row │ a          b
    │ Float64?   Float64?
────┼──────────────────────
1   │      2.0   missing
2   │      3.0       9.0
3   │  missing      16.0
4   │      5.0      25.0

julia> Impute.svd(Matrix(a))
4×2 Matrix{Union{Missing, Float64}}:
2.0       9.43342
3.0       9.0
3.44377  16.0
5.0      25.0

I also tried using it on a different amount of data that I originally intended to use this for, and it worked on that without a problem - drawing on what you've said, it might be because all of that information was by default read as floats. Thanks for your help, I'll close this issue.