invenia / Wrangling.jl

Wrangle your data into shape. Deals with Columns and Files and Lags and Cattle.
MIT License
1 stars 0 forks source link

Horizontally concatenate two narrow columns, then unstack on the concatenation #10

Open oxinabox opened 4 years ago

oxinabox commented 4 years ago

(for my reference this was #8 on the initial list of features)

Example by Ian Goddarrd

julia> using DataFrames,DataFramesMeta

julia> df = DataFrame(region = repeat(["North","North","South","South"],2),
             fuel_type = repeat(["gas","coal"],4),
             load = rand(8),
             time = [1,1,1,1,2,2,2,2],
             )
8×4 DataFrame
│ Row │ region │ fuel_type │ load     │ time  │
│     │ String │ String    │ Float64  │ Int64 │
├─────┼────────┼───────────┼──────────┼───────┤
│ 1   │ North  │ gas       │ 0.120039 │ 1     │
│ 2   │ North  │ coal      │ 0.108079 │ 1     │
│ 3   │ South  │ gas       │ 0.377439 │ 1     │
│ 4   │ South  │ coal      │ 0.96102  │ 1     │
│ 5   │ North  │ gas       │ 0.356801 │ 2     │
│ 6   │ North  │ coal      │ 0.259941 │ 2     │
│ 7   │ South  │ gas       │ 0.303486 │ 2     │
│ 8   │ South  │ coal      │ 0.889403 │ 2     │

julia> df = @transform(df,region_fuel_type = :region.*"_".*:fuel_type)
8×5 DataFrame
│ Row │ region │ fuel_type │ load     │ time  │ region_fuel_type │
│     │ String │ String    │ Float64  │ Int64 │ String           │
├─────┼────────┼───────────┼──────────┼───────┼──────────────────┤
│ 1   │ North  │ gas       │ 0.120039 │ 1     │ North_gas        │
│ 2   │ North  │ coal      │ 0.108079 │ 1     │ North_coal       │
│ 3   │ South  │ gas       │ 0.377439 │ 1     │ South_gas        │
│ 4   │ South  │ coal      │ 0.96102  │ 1     │ South_coal       │
│ 5   │ North  │ gas       │ 0.356801 │ 2     │ North_gas        │
│ 6   │ North  │ coal      │ 0.259941 │ 2     │ North_coal       │
│ 7   │ South  │ gas       │ 0.303486 │ 2     │ South_gas        │
│ 8   │ South  │ coal      │ 0.889403 │ 2     │ South_coal       │

julia> unstack(df,:time,:region_fuel_type,:load)
┌ Warning: `T` is deprecated, use `nonmissingtype` instead.
│   caller = compacttype(::Type, ::Int64) at show.jl:39
└ @ DataFrames ~/.julia/packages/DataFrames/0Em9Q/src/abstractdataframe/show.jl:39
2×5 DataFrame
│ Row │ time  │ North_coal │ North_gas │ South_coal │ South_gas │
│     │ Int64 │ Float64⍰   │ Float64⍰  │ Float64⍰   │ Float64⍰  │
├─────┼───────┼────────────┼───────────┼────────────┼───────────┤
│ 1   │ 1     │ 0.108079   │ 0.120039  │ 0.96102    │ 0.377439  │
│ 2   │ 2     │ 0.259941   │ 0.356801  │ 0.889403   │ 0.303486  │

@cbdavis suggests we could extend unstack from DataFrames (might move this issue there. cc @bkamins)

function DataFrames.unstack(df::DataFrame, 
                            rowkey::Union{Symbol,AbstractVector{Symbol}}, 
                            colkeys::AbstractVector{Symbol}, value::Symbol)
    new_colkey = Symbol(join(string.(colkeys), "_"))
    df[!,new_colkey] = [join(x, "_") for x in eachrow(df[:,colkeys])]
    return unstack(df, rowkey, new_colkey, value)
end

unstack(df,:time,[:region, :fuel_type],:load)
bkamins commented 4 years ago

If I understand things correctly you want to unstack on several columns and join their values to form target column names - right? I think it is OK to make a PR/issue to DataFrames.jl with this change if you want this functionality. It is non-breaking so we could add it in 1.x release in the worst case.

oxinabox commented 4 years ago

Yes, kind of like how you can join on multiple columns.

bkamins commented 4 years ago

see https://github.com/JuliaData/DataFrames.jl/issues/2148