invenia / Impute.jl

Imputation methods for missing data in julia
https://invenia.github.io/Impute.jl/latest/
Other
76 stars 11 forks source link

handle interp for integers by casting and rounding (#71) #142

Closed kpa28-git closed 3 months ago

kpa28-git commented 4 months ago

This is a simple solution for #71.

rofinn commented 3 months ago

I hope you don't mind. I opted to add the commit to your PR to retain the commit history. The force push was to trigger the CI as it was disabled. Adding a rounding mode and dispatching on the increment calculation mostly resolves #71 without introducing a breaking change. Feel free to check the new tests in case I missed something.

kpa28-git commented 3 months ago

Nope, I don't mind. Thanks! Looks good, I'll have a chance to test it later. I don't think I'll have anything to add.

kpa28-git commented 3 months ago

I found what looks to be a problem with interpolate going past a lower boundary, here are some cases:

ints = [8, fill(missing, 10)..., 2]
uints = [0x08, fill(missing, 10)..., 0x02]
Impute.impute(ints, Impute.Interpolate(r=RoundNearest);)   # [8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, 2]
Impute.impute(uints, Impute.Interpolate(r=RoundNearest);) # InexactError: trunc(UInt8, -1)

Making the gap between the top and bottom value smaller (in this example, it was 5 or lower) makes it not go past the bottom value, but gives RoundUp instead of RoundNearest:

ints_sm = [7, fill(missing, 10)..., 2]
uints_sm = [0x07, fill(missing, 10)..., 0x02]
Impute.impute(ints_sm, Impute.Interpolate(r=RoundNearest);)     # [7, 7, 7, ..., 7, 2]
Impute.impute(uints_sm, Impute.Interpolate(r=RoundNearest);)   # [0x07, 0x07, 0x07, ..., 0x07, 0x02]
rofinn commented 3 months ago

Right, forgot rounding would violate the increment assumption from before. I've added a fix:

julia> ints = [8, fill(missing, 10)..., 2];

julia> uints = [0x08, fill(missing, 10)..., 0x02];

julia> Impute.impute(ints, Impute.Interpolate(r=RoundNearest);)
12-element Vector{Union{Missing, Int64}}:
 8
 7
 6
 5
 4
 3
 2
 2
 2
 2
 2
 2

julia> Impute.impute(uints, Impute.Interpolate(r=RoundNearest);)
12-element Vector{Union{Missing, UInt8}}:
 0x08
 0x07
 0x06
 0x05
 0x04
 0x03
 0x02
 0x02
 0x02
 0x02
 0x02
 0x02
rofinn commented 3 months ago

Alright if there aren’t any comments I’m gonna go ahead and merge/tag this.

kpa28-git commented 3 months ago

I don't mind if you merge it.

This is for future reference (maybe another issue), but my only problem is that the code doesn't handle partial increments as expected. For example:

Impute.impute([08, fill(missing, 12)..., 02], Impute.Interpolate(r=RoundNearest))

[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 2]

round.(Int, Impute.impute([08., fill(missing, 12)..., 02.], Impute.Interpolate()), RoundNearest)

[8, 8, 7, 7, 6, 6, 5, 5, 4, 4, 3, 3, 2, 2] # what I expect

The user can always cast their data to floats, but then it would be the same as before.

For discrete types that might require a variable increment between values in gaps (e.g. integers), maybe we can dispatch _impute! or have a branch of the main function to do imputation for gaps each gap at a time. That way, each imputed value can be independent from the previous value.

This would allocate intermediate vectors but it would be better than having to cast the entire input to floats.

Any ideas or thoughts?

kpa28-git commented 3 months ago

Using a generator and setting the gap slices doesn't allocate and gives results that are more even over larger gaps Example (I don't know if there is a built-in way to set slices over generators so I made a genset! function):

_calculate_increment(a, b, n, ::Nothing) = (b - a) / n

function interpolate(a::T, b::T, n::Integer, r::RoundingMode) where {T<:Integer}
    inc = _calculate_increment(float(a), float(b), n, nothing)
    (round(T, a + inc*i, r) for i=1:n)
end

function genset!(v::AbstractVector, st, g)
    for (i, val) in enumerate(g)
       v[i+st] = val
    end
    v
end

v = [8, fill(missing, 12)..., 2]
v2 = [0x08, fill(missing, 12)..., 0x02]
@benchmark genset!(v, 1, interpolate(8, 2, 12, RoundNearest)) # no alloc
@benchmark genset!(v2, 1, interpolate(0x08, 0x02, 12, RoundNearest)) # no alloc
rofinn commented 3 months ago

Sure. If we don't care about the increment type, then I think just rounding/clamping on the insert could make sense.

kpa28-git commented 3 months ago

Well I have something here that might work. It does the interpolation in floating point space, but uses a generator for to avoid allocation. I haven't benchmarked it on large inputs yet, but on small veectors it has about the same alloc as the integer increment version and maybe a bit faster. It passes all the tests, added some more in this commit too.

kpa28-git commented 3 months ago

The second commit is just to ignore RoundingMode for floats, probably don't want to have the same RoundingMode apply to different vectors in a mixed dataset. Anyway, we usually don't want to round floats.

rofinn commented 3 months ago

Sure, makes sense to me 👍🏻 I just added a commit to minimize the changes. Rather than creating a generator with duplicated logic to insert values, we can just dispatch on two special cases:

  1. Calculating an increment value with an Unsigned
  2. Rounding when inserting a non-integer value into an integer array if a RoundingMode is specified.
kpa28-git commented 3 months ago

I pushed what I had now (based on your feedback + some benchmarking) in case I can't get back to you quick enough. I don't know if you're interested, but I figure sharing what I have is better.