invenia / Impute.jl

Imputation methods for missing data in julia
https://invenia.github.io/Impute.jl/latest/
Other
77 stars 11 forks source link

Suprising behaviour of Impute.replace #137

Open ueliwechsler opened 1 year ago

ueliwechsler commented 1 year ago

I was struggle a bit with the Impute.replace functionality, as at first I did not realize that replacing 0 is not the same as replacing 0.0.

See the following minimal example:

julia> using Impute
julai> using DataFrames
julia> df_test = DataFrame(a = [missing, 1.2, 2.0, 3.5])
julia> transform!(df_test, "a" => Impute.replace(values=0) => "a")
4×1 DataFrame
 Row │ a         
     │ Float64?  
─────┼───────────
   1 │ missing   
   2 │       1.2
   3 │       2.0
   4 │       3.54×1 DataFrame
julia> transform!(df_test, "a" => Impute.replace(values=0.0) => "a")
 Row │ a        
     │ Float64? 
─────┼──────────
   1 │      0.0
   2 │      1.2
   3 │      2.0
   4 │      3.5

I know it is written in the docs

If the input data is of a different type then the no replacement will be performed.

But it does not feel intuitive to me to not replace the value (and not provide a warning), especially if it could be directly converted into the type of column values.

Is there something that could be done to not "silently fail" if the value is not replaced?

rofinn commented 1 year ago

This is an intentional design choice in order support passing tuples of values.

# From https://github.com/invenia/Impute.jl/blob/master/test/imputors/replace.jl#L23
imp = Replace(; values=(DateTime(0), -9999, NaN, ""))
df_table = DataFrame(
    :time => [missing, DateTime(2020, 02, 02), DateTime(2121, 12, 12)],
    :loc => [12, -5, missing],
    :val => [1.5, missing, 3.0],
    :desc => ["foo", "bar", missing],
)
df_expected = DataFrame(
    :time => [DateTime(0), DateTime(2020, 02, 02), DateTime(2121, 12, 12)],
    :loc => [12, -5, -9999],
    :val => [1.5, NaN, 3.0],
    :desc => ["foo", "bar", ""],
)

Simply adding a type conversion would break this functionality and adding a warning would be very noisy. My only thought is that you could probably use typejoin and either iteratively/recursively call supertype to identify the closest match. Though I don't think the added complexity and overhead would be worth it.

ueliwechsler commented 1 year ago

Thank you for the explanation. My example was only considering replacing the values for one column. In the more general context of replacing values for more than one column, the design makes sense to me.

Would you mind me creating a PR to update the documentation such that the example shows a case where the values are not replaced and maybe add a caveat box to highlight the behavior?