queryverse / Query.jl

Query almost anything in julia
Other
394 stars 50 forks source link

@groupby failed on CategoricalString with missing value #278

Open AnselmJeong opened 5 years ago

AnselmJeong commented 5 years ago

Let's say, we try to @groupby on a CategoricalString column

df = DataFrame(
    fruit    =  ["Apple", "Banana", "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})

The result is exactly as expected.

3×2 DataFrame
│ Row │ key          │ m       │
│     │ Categorical… │ Float64 │
├─────┼──────────────┼─────────┤
│ 1   │ Apple        │ 1.2     │
│ 2   │ Banana       │ 2.0     │
│ 3   │ Cherry       │ 0.4     │

Next, we make the same column contain some missing values.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})

Then, if failed with an error message stating somewhat like this.

ERROR: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous

However, this behavior is weird since String column with similar missing values does not issue any complaints.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})
key      │ m
─────────┼────
"Apple"  │ 1.2
#NA      │ 2.0
"Cherry" │ 0.4
"Banana" │ 2.0

This limitation may have to be fixed, since it is natural that String columns are recoded as Categorical columns before beginning any analysis.

Thanks.

AnselmJeong commented 5 years ago

Similar behavior found on @filter missing values.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @filter(!isna(_.fruit)) |> DataFrame

It fails with the same ERROR message:

LoadError: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous

This example works without any error if String colums were not converted into Categorical columns or if there were not any missing values.

The same kind of error persists with simpler types of filtering.

df |> @filter(_.fruit == "Banana") |> DataFrame

Also fails with the same ERROR message.

Conclusively, @filter or @groupby seems not work with Categorical variables with missing value.

Thanks.

greimel commented 5 years ago

Same for

df |> @select(:price)

or

df |> @filter(true)

(with the above DataFrame).

So it seems that the problem occurs when transforming the DataFrame to a Query.jl iterator.

davidanthoff commented 5 years ago

This is actually unrelated to Query.jl, one can get the same error with:

using IteratorInterfaceExtensions

it = getiterator(df)

first(it)

The implementation for that lives in Tables.jl.

The root problem though seems to be that both DataValues.jl and CategoricalArrays.jl add methods to convert that don't play well together.

I'm a bit lost what to do here, maybe @quinnj has an idea? Also CCing @nalimilan, the resident categorical expert. Is this a case that needs special casing in Tables.jl?

nalimilan commented 5 years ago

I haven't been able to reproduce locally (even after updating all packages). Is there anything special I need to do? I have DataValues 0.4.12.

davidanthoff commented 5 years ago

Here is what I'm using:

(repo1) pkg> st
    Status `C:\Users\david\.julia\environments\repo1\Project.toml`
  [a93c6f00] DataFrames v0.19.2
  [e7dc6d0d] DataValues v0.4.12
  [82899510] IteratorInterfaceExtensions v1.0.0

And the code:

julia> using DataFrames, IteratorInterfaceExtensions, DataValues

julia> df = DataFrame(fruit= ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"], price= [1.2, 2.0, 0.4, 1.2, 2.0, 0.4]);

julia> categorical!(df, :fruit);

julia> it = getiterator(df);

julia> first(it)
ERROR: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous. Candidates:
  convert(::Type{S}, x::T) where {S, T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)} in CategoricalArrays at C:\Users\david\.julia\packages\CategoricalArrays\xjesC\src\value.jl:91
  convert(::Type{DataValue{T}}, x::T) where T in DataValues at C:\Users\david\.julia\packages\DataValues\XQWvG\src\scalar\core.jl:31
Possible fix, define
  convert(::Type{DataValue{T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)}}, ::T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R))
Stacktrace:
 [1] macro expansion at C:\Users\david\.julia\packages\Tables\FXXeK\src\tofromdatavalues.jl:102 [inlined]
 [2] iterate(::Tables.DataValueRowIterator{NamedTuple{(:fruit, :price),Tuple{DataValue{CategoricalString{UInt32}},Float64}},Tables.RowIterator{NamedTuple{(:fruit, :price),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing},Array{Float64,1}}}}}, ::Tuple{}) at C:\Users\david\.julia\packages\Tables\FXXeK\src\tofromdatavalues.jl:96 (repeats 2 times)
 [3] first(::Tables.DataValueRowIterator{NamedTuple{(:fruit, :price),Tuple{DataValue{CategoricalString{UInt32}},Float64}},Tables.RowIterator{NamedTuple{(:fruit, :price),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing},Array{Float64,1}}}}}) at .\abstractarray.jl:342
 [4] top-level scope at REPL[11]:1 

Make sure you load DataValues, otherwise Tables.jl will use a different code path.

nalimilan commented 5 years ago

Still no error with these versions:

(v1.2) pkg> st
    Status `~/.julia/environments/v1.2/Project.toml`
  [6e4b80f9] BenchmarkTools v0.4.2
  [336ed68f] CSV v0.5.0 [`~/.julia/dev/CSV`]
  [324d7699] CategoricalArrays v0.5.5 [`~/.julia/dev/CategoricalArrays`]
  [a93c6f00] DataFrames v0.19.2
  [e7dc6d0d] DataValues v0.4.12
  [82899510] IteratorInterfaceExtensions v1.0.0
  [2dfb63ee] PooledArrays v0.5.2+ [`~/.julia/dev/PooledArrays`]
  [1a8c2f83] Query v0.12.1
  [295af30f] Revise v1.0.2
  [bd369af6] Tables v0.2.6
davidanthoff commented 5 years ago

@nalimilan: here is a reproducer repo that you can run with this: Binder

nalimilan commented 5 years ago

Thanks. So this is indeed a legitimate ambiguity which cannot be fixed separately in either package. I think @quinnj's https://github.com/JuliaLang/julia/pull/32613 should help fixing this, but that won't work for current Julia versions. Maybe we should define that unwrap method in Compat or in DataAPI. But first we should bump that PR so that at least it's merged in master.

Do you think this is a new problem? I still don't understand why I didn't observe it in my original tests.