queryverse / Query.jl

Query almost anything in julia
Other
394 stars 49 forks source link

Column types get obliterated by Query.jl #313

Open samuela opened 4 years ago

samuela commented 4 years ago

I have a DataFrame df with correct column types (String, Float64, etc). However after processing with Query.jl I'm getting only Any column types. Here's the suspect code snippet:

brand_code_df =
    df |>
    @groupby((_.Brand, _.Product_Code)) |>
    @map({
        Brand = key(_)[1],
        Product_Code = key(_)[2],
        WAP = sum(_.Unit_Price .* _.Units) / sum(_.Units),
        WAC = sum(_.Unit_Cost .* _.Units) / sum(_.Units),
        total_sales = sum(_.Sales),
        gross_margin = sum(_.Margin),
        GMP = sum(_.Margin) / sum(_.Sales) * 100,
        total_code_units = sum(_.Units),
        weight = sum(_.Units) / brand_total_units[key(_)[1]],
        unique_prices = unique_prices(_),
    }) |>
    DataFrame

Now, brand_code_df will have only Any column types.

OTOH I've found that doing ... |> collect |> DataFrame does in fact retain the correct column types.

davidanthoff commented 4 years ago

Is there a chance that you could a) post a short snippet that creates a DataFrame with the correct columns and just 1-2 rows with sample data? Literally something like DataFrame(Brand=["asdf", "lij"], Product_Code=[3, 4]) or something like that, so that I can reproduce this, and 2) can you post the code for unique_prices?

samuela commented 4 years ago

Hey @davidanthoff ! Yeah, let me see if I can come up with some mock data that have the same effect...

extradosages commented 3 years ago

I've observed this when using @mutate.

davidanthoff commented 3 years ago

@extradosages @mutate uses @map under the hood. Any more data you could provide to replicate this would be helpful.

i-aki-y commented 3 years ago

Hi, @davidanthoff I encountered the same problem. I hope this small example helps you somewhat.

julia> using DataFrames
julia> using Query

julia> struct Item
           value::Union{Missing, Float64}
       end

julia> df = DataFrame(:x => [Item(1.0)])
1×1 DataFrame
 Row │ x
     │ Item
─────┼───────────
   1 │ Item(1.0)

julia> df |> @mutate(y = _.x.value) |> DataFrame
1×2 DataFrame
 Row │ x          y
     │ Any        Any
─────┼────────────────
   1 │ Item(1.0)  1.0
i-aki-y commented 3 years ago

I have examined this problem furthermore.

The problem seems to happen in a return type estimation of a map function that is defined in the QueryOperators.

function map(source::Enumerable, f::Function, f_expr::Expr)
    TS = eltype(source)
    T = Base._return_type(f, Tuple{TS,})
    S = typeof(source)
    Q = typeof(f)
    return EnumerableMap{T,S,Q}(source, f)
end

cf. https://github.com/queryverse/QueryOperators.jl/blob/fd7534405a5f2db2d555f4dd9e796205d7711cde/src/enumerable/enumerable_map.jl#L12

Although I'm not sure what the Base._return_type is since it is undocumented, it seems to estimate a return type of the function f that generates a NamedTuple from the argument of @mutate. And it fails with some kind of input are given.

This is an example.

using DataFrames
using QueryOperators

struct Item1
    value::Union{Missing, Int64}
end

struct Item2
    value::Int64
end

QueryOperators.map(QueryOperators.query([Item1(1.0)]), item -> (v = item.value, ), :()) |> DataFrame |> println
#1×1 DataFrame
# Row │ v   
#     │ Any 
#─────┼─────
#   1 │ 1

QueryOperators.map(QueryOperators.query([Item2(1.0)]), item -> (v = item.value, ), :()) |> DataFrame |> println
#1×1 DataFrame
# Row │ v     
#     │ Int64 
#─────┼───────
#   1 │     1

Sorry, I'm not sure why it happens, whether this is some limitation of a type inference of the language or kind of bugs. It will be difficult to investigate the cause any further with my limited knowledge now.

Anyway, I hope this will help.

tlamadon commented 3 years ago

I am having the same problem on grouping on multiple columns. However ... |> collect |> DataFrame, so thanks for that suggestion!