queryverse / Query.jl

Query almost anything in julia
Other
396 stars 50 forks source link

`@where` does not preserve CategoricalArray #157

Open alyst opened 7 years ago

alyst commented 7 years ago

Julia v0.6, DataFrames, CategoricalArray and Query masters

julia> using DataFrames, Query

julia> df = DataFrame(a = categorical(["a", "b", "a", "c"]))
4×1 DataFrames.DataFrame
│ Row │ a │
├─────┼───┤
│ 1   │ a │
│ 2   │ b │
│ 3   │ a │
│ 4   │ c │

julia> typeof(df[:a])
CategoricalArrays.CategoricalArray{String,1,UInt32,String,Union{}}

julia> dff = df |> @where(_.a != "b") |> DataFrame
3×1 DataFrames.DataFrame
│ Row │ a │
├─────┼───┤
│ 1   │ a │
│ 2   │ a │
│ 3   │ c │

julia> typeof(dff[:a])
Array{CategoricalArrays.CategoricalValue{String,UInt32},1}

IIUC Array{CategoricalValue} is suboptimal in terms of storage. Also CategoricalArrays API like levels(dff[:a]) is not available.

nalimilan commented 7 years ago

Indeed Array{CategoricalValue} isn't supposed to be used. It's kind of annoying that we don't have a way to specify that a given element type should always be stored in a given container type.

alyst commented 7 years ago

I have some minimal solution to this, by it still involves moving+extending Enumerable abstract type definitions from QueryOperators.jl to TableTraits.jl and using them from IterableTables.jl. So it's pretty invasive. I can open PRs, but I suppose the core JuliaData contributors already have some strategy how to fix this and the other design problems (e.g. IterableTables.jl being dependent on every data-related package).

davidanthoff commented 7 years ago

The proper way to implement this would be to make sure that at collection into a DataFrame any column of type CategoricalValue uses the proper container type. Essentially just another elseif case here. Having said that, I think anything like that would require a bit more thinking about larger implications. I was a couple of times close to just adding this, but in the end I'm just not yet sure whether I want this in the iterable tables interface or not (main tradeoff is complexity, I'm trying to keep iterable tables as simple as possible, and I'm not sure this would still fall into that category).

davidanthoff commented 7 years ago

It's kind of annoying that we don't have a way to specify that a given element type should always be stored in a given container type.

YES, I completely agree. I think it would be absolutely fantastic if there was a standard way in base to say "for element type T, the default array type should be X", and if things like [a,b,c] (where a, b and c are instances of T) constructed an X, and and fill etc. all used that. I think a simple function like getdefaultarraytype(::Type{T}) where T = Array{T} in base would probably do it, where custom Ts could add a method and the various base array creation methods would rely on it.

I suggested something like that in https://github.com/JuliaLang/julia/issues/22630, but probably doomed all chances of success of that proposal by embedding it in a crazy rename idea ;) I think a much less ambitious proposal along the lines of the above paragraph would still be a great idea.

nalimilan commented 7 years ago

Maybe we should just override similar(::Array, ::Type{<:CategoricalValue})?

davidanthoff commented 6 years ago

Where would that be picked up?

I guess I was so far mostly worrying about [a,b,c] when a, b and c are of type CategoricalValue (or in my case DataValue) and how one can intercept that. I think we can already intercept CategoricalValue[a,b,c], and this line seems key to intercepting all cases.

I kind of feel that if someone is asking for an Array specifically, he/she should get that specific type, right? Not sure...

nalimilan commented 6 years ago

I kind of feel that if someone is asking for an Array specifically, he/she should get that specific type, right? Not sure...

In this case, the caller wouldn't require an Array, just an object "similar" to the input Array, so I'd say it wouldn't be absurd to return a CategoricalArray if the element type is a categorical value.