Closed kpa28-git closed 3 months ago
I hope you don't mind. I opted to add the commit to your PR to retain the commit history. The force push was to trigger the CI as it was disabled. Adding a rounding mode and dispatching on the increment calculation mostly resolves #71 without introducing a breaking change. Feel free to check the new tests in case I missed something.
Nope, I don't mind. Thanks! Looks good, I'll have a chance to test it later. I don't think I'll have anything to add.
I found what looks to be a problem with interpolate going past a lower boundary, here are some cases:
ints = [8, fill(missing, 10)..., 2]
uints = [0x08, fill(missing, 10)..., 0x02]
Impute.impute(ints, Impute.Interpolate(r=RoundNearest);) # [8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, 2]
Impute.impute(uints, Impute.Interpolate(r=RoundNearest);) # InexactError: trunc(UInt8, -1)
Making the gap between the top and bottom value smaller (in this example, it was 5 or lower) makes it not go past the bottom value, but gives RoundUp instead of RoundNearest:
ints_sm = [7, fill(missing, 10)..., 2]
uints_sm = [0x07, fill(missing, 10)..., 0x02]
Impute.impute(ints_sm, Impute.Interpolate(r=RoundNearest);) # [7, 7, 7, ..., 7, 2]
Impute.impute(uints_sm, Impute.Interpolate(r=RoundNearest);) # [0x07, 0x07, 0x07, ..., 0x07, 0x02]
Right, forgot rounding would violate the increment assumption from before. I've added a fix:
julia> ints = [8, fill(missing, 10)..., 2];
julia> uints = [0x08, fill(missing, 10)..., 0x02];
julia> Impute.impute(ints, Impute.Interpolate(r=RoundNearest);)
12-element Vector{Union{Missing, Int64}}:
8
7
6
5
4
3
2
2
2
2
2
2
julia> Impute.impute(uints, Impute.Interpolate(r=RoundNearest);)
12-element Vector{Union{Missing, UInt8}}:
0x08
0x07
0x06
0x05
0x04
0x03
0x02
0x02
0x02
0x02
0x02
0x02
Alright if there aren’t any comments I’m gonna go ahead and merge/tag this.
I don't mind if you merge it.
This is for future reference (maybe another issue), but my only problem is that the code doesn't handle partial increments as expected. For example:
Impute.impute([08, fill(missing, 12)..., 02], Impute.Interpolate(r=RoundNearest))
[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 2]
round.(Int, Impute.impute([08., fill(missing, 12)..., 02.], Impute.Interpolate()), RoundNearest)
[8, 8, 7, 7, 6, 6, 5, 5, 4, 4, 3, 3, 2, 2] # what I expect
The user can always cast their data to floats, but then it would be the same as before.
For discrete types that might require a variable increment between values in gaps (e.g. integers), maybe we can dispatch _impute!
or have a branch of the main function to do imputation for gaps each gap at a time. That way, each imputed value can be independent from the previous value.
This would allocate intermediate vectors but it would be better than having to cast the entire input to floats.
Any ideas or thoughts?
Using a generator and setting the gap slices doesn't allocate and gives results that are more even over larger gaps
Example (I don't know if there is a built-in way to set slices over generators so I made a genset!
function):
_calculate_increment(a, b, n, ::Nothing) = (b - a) / n
function interpolate(a::T, b::T, n::Integer, r::RoundingMode) where {T<:Integer}
inc = _calculate_increment(float(a), float(b), n, nothing)
(round(T, a + inc*i, r) for i=1:n)
end
function genset!(v::AbstractVector, st, g)
for (i, val) in enumerate(g)
v[i+st] = val
end
v
end
v = [8, fill(missing, 12)..., 2]
v2 = [0x08, fill(missing, 12)..., 0x02]
@benchmark genset!(v, 1, interpolate(8, 2, 12, RoundNearest)) # no alloc
@benchmark genset!(v2, 1, interpolate(0x08, 0x02, 12, RoundNearest)) # no alloc
Sure. If we don't care about the increment type, then I think just rounding/clamping on the insert could make sense.
Well I have something here that might work. It does the interpolation in floating point space, but uses a generator for to avoid allocation. I haven't benchmarked it on large inputs yet, but on small veectors it has about the same alloc as the integer increment version and maybe a bit faster. It passes all the tests, added some more in this commit too.
The second commit is just to ignore RoundingMode for floats, probably don't want to have the same RoundingMode apply to different vectors in a mixed dataset. Anyway, we usually don't want to round floats.
Sure, makes sense to me 👍🏻 I just added a commit to minimize the changes. Rather than creating a generator with duplicated logic to insert values, we can just dispatch on two special cases:
Unsigned
RoundingMode
is specified.I pushed what I had now (based on your feedback + some benchmarking) in case I can't get back to you quick enough. I don't know if you're interested, but I figure sharing what I have is better.
This is a simple solution for #71.