rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
272 stars 38 forks source link

Feature Requests for DimStack: 1) @rtransform 2) enum as dimension 3) Common dimensions 4) to DataFrame #410

Closed Lincoln-Hannah closed 7 months ago

Lincoln-Hannah commented 1 year ago

1 @rtransform

@rtransform in DataFramesMeta.jl supports row level expressions that create new columns in an existing dataframe. Example

using Chain, DataFrames, DataFramesMeta

@chan begin
     DataFrame(  A = 1:5 )

     @rtransform @astable begin
             :B = 2 * :A
             :C = :B + 2  
             :D = log(  :C )
      end
end

Could a similar macro @layer_transform add new layers to an existing DimStack. The DimStack documentation example has three layers. With such a macro it could be created as:

A       = [1.0 2.0 3.0; 4.0 5.0 6.0]
dimz    = ( X([:a, :b]), Y(10.0:10.0:30.0) )

sXY = @chain begin
           DimStack( DimArray( A, dimz; name=:one)  )

           @layer_transform begin
                  :two     = 2 * :one
                  :three   = 3 * :one
           end
end

2 enum dimensions

Could an enum be used as a dimension. Example

@enum X a b
dimz = ( X,  Y(10.0:10.0:30.0) )

The advantage of this is an enum can then be used as a structure field type. e.g.

@with_kw  struct  MyStruct
     x1::X
     x2::X
end

The possible values of x1 and x2 are then restricted to the values of the dimension. (A Primary Key - Foreign Key constraint).

3 Automatic joining on common dimensions.

Consider the DimStack above. and another DimStack sX defined over dimension X with layers l1 l2

sX = @chain begin
        DimStack( DimArray( [1,2], X(:a,:b); name=:l1)  )

        @layer_transform begin
           :l2 =  2 * :l1
       end
end

since sX is has dimension X which is common to sXY could the layers of sX be used within sXY. something like:

@chain sXY begin

   @layer_transform begin
          :new_layer   =   sX.l1     +   2
          :new_layer2  =   sX.l2     +  :three
   end

end

And could the layers of sXY be used in sX within aggregation statements, as the result would be aggregated over dimension Y.

@chain sX  begin
       @layer_transform begin
          :new  =  sum( sXY.one )
          :new2  = maximum( sXY.two )
       end
end

4 to and from DataFrame

DataFrame stack and 'unstack' pivot columns to rows and vice versa. Similarly, could there be a function like

toDimStack(   df::DataFrame,  dimensions )

That takes a DataFrame and a list of columns that become dimensions, with the remaining columns becoming layers ? and

toDataFrame(  ds::DimStack )

That converts all dimensions and layers to DataFrame columns with the number of rows being the product of the dimensions.

rafaqz commented 1 year ago

Thanks, these are interesting suggestions!

There is already a DimTable object and DataFrame(dimstack) should just work already. It will wrap dimensions and layers of different sizes to match the length of the largest layer. Converting dataframes back to stacks is not so easy yet (see #327), although I think DimArray will automatically reshape column vectors for you based on the dims argument lengths.

It should be easier to add layers to dimstacks, and your macro ideas are pretty cool. But personally I would prefer the macro-free syntax was better before diving into macros.

So, as plan of action:

  1. write out the macro-free code to do everything you are doing above
  2. see where it should be improved and implement it, using NamedTuple/pairs style syntax as much as possible. (e.g. defining merge(::NamedTuple, ::AbstractDimStack) would get us a long way. It would be nice if newstack=DimStack((; oldstack..., newlayer=x)) worked.
  3. finally if it is still too verbose, implement these macros.

How does that sound?

(As for enums, I'm not sure how that would work... the dimension object is pretty tightly integrated here, and the fact they are type wrappers is how everything compiles away. There is also the constraint that a dimension used in a DimArray must have a length matching the axis length it represents)

rafaqz commented 1 year ago

Ultimately I would like to extend something like these methods for DimStack, and I've been waiting for it to get into base:

https://github.com/JuliaLang/julia/pull/46453

But we could already make setindex work, and make merge work when mixed with NamedTuple.

Lincoln-Hannah commented 1 year ago

DataFrame(dimstack) I should have tried before writing the first note. Works perfectly thank you :)

on enums. I can write

using Lazy
@enum Currency AUD USD GBP EUR
dCurrency = @> Currency instances collect X

I wanted a way to have a struct field type linked to a dimension by having similar name and restricting possible values to those of the dimension (PK-FK). This seems a good way. Unless you know something better?

macro-free syntax - suggestions

Allow new layers to be created. s[:four] = 4s[:one] Shortcut syntax for existing layers (like DataFrame columns) s.one

orthogonal dimensions Could the dimensions X, Y, Z (and any others) be orthogonal so that

dX = DimArray(  1:2, X(1:2)  )
dY = DimArray(  1:2, Y(1:2)  )
dX  .+   dY  

is equivalent to DimArray( (1:2) .+ (1:2)', (X(1:2),Y(1:2)) )

More generally for DA1 ... DA4 defined over some combination of X,Y,Z an expression like V = @. DA1 + DA2 * DA3 / DA4 defines V over X,Y,Z.

More difficult An expression like DimStck.New = sum( V ) Aggregates over any dimensions in V but not in DimStck. and repeats values for dimensions in DimStck but not in V.

rafaqz commented 1 year ago

This is imperative:

s[:four] = 4s[:one]

DimensionalData.jl is written largely with functional style, because, besides array indexing and metadata everything is immutable. this means it will work on GPU, which is a core design goal. But you cant directly change any of the objects like that.

What you can already do is this (I think?):

s2 = merge(s, rebuild(4s[:one]; name=:four))
# or
s2 = merge(s, DimStack((; four=4s[:one])))

But there is probably an easier way.

What we need to move this issue forward is getting your ideas written out how they currently work, so we can point out the real weaknesses in the existing syntax and take small, actionable steps towards improvement. To be very clear if you want these features you will need to do this work (but it will be very much appreciated by me and other users of this package).

This is one of over 30 packages for me, and bugfixes and core functionality has to have priority over features, so I wont personally have time to write this out until I have a direct need for it.

Lincoln-Hannah commented 1 year ago

Thanks Rafael. I'll have a go.

One last comment (then I'll shut up). Re GPUs - I'm a novice but a common practice seems to be declaring arrays with blank data before populating them. With a DimStack you can set a cell-tuple e.g. s[At(1),At(3)] = (one=10,two=20) but not a layer s[:one] = d1. If it was the other way round, you could create a DimStack with all layers defined by just name and type. Then gradually populate them with s[:layer] = ...

rafaqz commented 1 year ago

No worries, don't shut up this is a super useful discussion.

I just want to be clear that more comes of this if you put in the time to map out a plan with clear pointers to the current shortfalls than to post big ideas far from the current implementation for me to implement, because I'm really unlikely to have time to think through the design for that. But I can fix merge and setindex to work better for DimStack in a few minutes.

But about the stack. A stack is essentially a NamedTuple of AbstractArrays. You can't set fields of a NamedTuple. But it has to be a NamedTuple rather than a Dictso that indexing like you do there is super fast, and so the whole object can be used as a GPU kernel argument.

Arrays are the privileged mutable part of this package. What you really seem to want is a MutableDimStack, that is backed by a Dict rather than a NamedTuple. This will be much slower to index into. Like 100x slower, because the NamedTuple indexing compiles away. But it could be useful to have?

We can instead make the immutable syntax better, like this really should work and is a tiny fix:

s = merge(s, (one=d1,))
Lincoln-Hannah commented 1 year ago

You can't change fields of a NamedTuple but here (I think) the fields are pointers / references to the underlying Arrays. With your example s2 = merge(s, DimStack((; four=4s[:one]))) if you change a cell in s, the same cell changes in s2. So if you can change one cell in an Array then why not the whole array. (without altering the NamedTuple that points to it.) Will try set up an example tomorrow, and will look at the things you suggest. bed time over here :)

rafaqz commented 1 year ago

Well you can always update a whole array with a broadcast if it already exists:

s[:one] .= d1

But this is attempting to change a pointer to point to a different array, by running setindex! on an immutable NamedTuple:

s[:one] = d1

Thats just not possible: we need to make a new NamedTuple with a pointer to the new array, which is what merge does.

Lincoln-Hannah commented 1 year ago

Would you consider allowing broadcast_dims to be applied to Arrays and NamedTuples with dimension indices coming from the array or tuple values ? Something like

using DimensionalData, NamedTupleTools
using DimensionalData: DimArray, broadcast_dims

function LH_Dimension(x) 
    dimname = Symbol(join(x))
    Dict( dimname => x ) |> NamedTuple
end

DimArray(x::Array)      = DimArray( x,      LH_Dimension(x) )
DimArray(x::NamedTuple) = DimArray( [x...], [keys(x)...] |> LH_Dimension )

u = Union{AbstractDimArray,Array,NamedTuple}

LH_broadcast_dims( f, x::u, y::u ) = broadcast_dims( f, DimArray(x), DimArray(y) )

Use case: create a DimArray to hold connections across different Servers and Databases.

Servers = ( Prod = "Prod_string", Dev  = "Dev_string", UAT = "UAT_string" )
DBs     = [:Load, :Rates, :Warehouse] 

Connect(Server,DB) = "$Server $DB"  #would actually be ODBC...

x = LH_broadcast_dims( Connect, Servers, DBs )

x[At(:Prod),At(:Rates)]

The LH_Dimension function is messy because I have to create a name for each dimension, which has to be a variable name in a NamedTuple. (maybe there's a way around this I'm not aware of).

rafaqz commented 7 months ago

Closing this as too long and complicated to be actionable. If you have any single contained feature requests, please write them up one at a time.

Closing for now.