invenia / Impute.jl

Imputation methods for missing data in julia
https://invenia.github.io/Impute.jl/latest/
Other
76 stars 11 forks source link

Unexpected behavior when using Impute.locf within groupby-combine procedure #140

Open BeitianMa opened 8 months ago

BeitianMa commented 8 months ago

As the following code shows, i want to forward fill missing values use Impute.locf function, but just within the same :id

using DataFramesMeta, Impute

df = DataFrame(id = repeat(1:3, 2), value = [1,missing,3,4,missing,missing])

combine(groupby(df, :id), :value => (x -> Impute.locf(x)) => :value)

Unexpectedly, it raises

ERROR: AssertionError: !(all(ismissing, data))

this is clearly beacause there are all missing value under the same :id=2, but the following code

df = DataFrame(id = repeat(1:3, 2), value = [missing,missing,missing,missing,missing,missing])

transform(df, :value = (x -> Impute.locf(x)) => :value)

completed with no error. It just leaves all values missing, which is the desired result

Row  │ id     value   
     │ Int64  Missing 
─────┼────────────────
   1 │     1  missing 
   2 │     2  missing 
   3 │     3  missing 
   4 │     1  missing 
   5 │     2  missing 
   6 │     3  missing

My questions are:

  1. Is it a bug or a feature (for some concerns I don't know)?
  2. How do I get the (grouped) results? Of course, the simpler the code, the better.

Thanks in advance!

nilshg commented 8 months ago

As I explained on Discourse, this has nothing to do with groupby:

julia> locf([missing])
1-element Vector{Missing}:
 missing

julia> locf(Union{Float64, Missing}[missing])
ERROR: AssertionError: !(all(ismissing, data))
rofinn commented 8 months ago

Hmm, I believe this was introduced to avoid having LOCF silently fail to impute any values. Perhaps we should support a flag or something... If you're positive that you don't want the error to be raised then the easiest solution would probably be something like this.

combine(groupby(df, :id), :value => (x -> Impute.locf(identity.(x))) => :value)

If everything is missing then this will reallocate your array to be Vector{Missing}. Depending on your data using something like ResultTypes.jl with a condition on the error case would allocate less memory, but be slightly more verbose.