Closed Lincoln-Hannah closed 7 months ago
Thanks, these are interesting suggestions!
There is already a DimTable
object and DataFrame(dimstack)
should just work already. It will wrap dimensions and layers of different sizes to match the length of the largest layer. Converting dataframes back to stacks is not so easy yet (see #327), although I think DimArray
will automatically reshape
column vectors for you based on the dims argument lengths.
It should be easier to add layers to dimstacks, and your macro ideas are pretty cool. But personally I would prefer the macro-free syntax was better before diving into macros.
So, as plan of action:
merge(::NamedTuple, ::AbstractDimStack)
would get us a long way. It would be nice if newstack=DimStack((; oldstack..., newlayer=x))
worked.How does that sound?
(As for enums, I'm not sure how that would work... the dimension object is pretty tightly integrated here, and the fact they are type wrappers is how everything compiles away. There is also the constraint that a dimension used in a DimArray
must have a length matching the axis length it represents)
Ultimately I would like to extend something like these methods for DimStack
, and I've been waiting for it to get into base:
https://github.com/JuliaLang/julia/pull/46453
But we could already make setindex
work, and make merge
work when mixed with NamedTuple
.
DataFrame(dimstack)
I should have tried before writing the first note. Works perfectly thank you :)
on enums. I can write
using Lazy
@enum Currency AUD USD GBP EUR
dCurrency = @> Currency instances collect X
I wanted a way to have a struct field type linked to a dimension by having similar name and restricting possible values to those of the dimension (PK-FK). This seems a good way. Unless you know something better?
Allow new layers to be created. s[:four] = 4s[:one]
Shortcut syntax for existing layers (like DataFrame columns) s.one
orthogonal dimensions Could the dimensions X, Y, Z (and any others) be orthogonal so that
dX = DimArray( 1:2, X(1:2) )
dY = DimArray( 1:2, Y(1:2) )
dX .+ dY
is equivalent to DimArray( (1:2) .+ (1:2)', (X(1:2),Y(1:2)) )
More generally for DA1 ... DA4 defined over some combination of X,Y,Z an expression like
V = @. DA1 + DA2 * DA3 / DA4
defines V over X,Y,Z.
More difficult
An expression like DimStck.New = sum( V )
Aggregates over any dimensions in V but not in DimStck.
and repeats values for dimensions in DimStck but not in V.
This is imperative:
s[:four] = 4s[:one]
DimensionalData.jl is written largely with functional style, because, besides array indexing and metadata everything is immutable. this means it will work on GPU, which is a core design goal. But you cant directly change any of the objects like that.
What you can already do is this (I think?):
s2 = merge(s, rebuild(4s[:one]; name=:four))
# or
s2 = merge(s, DimStack((; four=4s[:one])))
But there is probably an easier way.
What we need to move this issue forward is getting your ideas written out how they currently work, so we can point out the real weaknesses in the existing syntax and take small, actionable steps towards improvement. To be very clear if you want these features you will need to do this work (but it will be very much appreciated by me and other users of this package).
This is one of over 30 packages for me, and bugfixes and core functionality has to have priority over features, so I wont personally have time to write this out until I have a direct need for it.
Thanks Rafael. I'll have a go.
One last comment (then I'll shut up). Re GPUs - I'm a novice but a common practice seems to be declaring arrays with blank data before populating them. With a DimStack you can set a cell-tuple e.g. s[At(1),At(3)] = (one=10,two=20)
but not a layer s[:one] = d1
. If it was the other way round, you could create a DimStack with all layers defined by just name and type. Then gradually populate them with s[:layer] = ...
No worries, don't shut up this is a super useful discussion.
I just want to be clear that more comes of this if you put in the time to map out a plan with clear pointers to the current shortfalls than to post big ideas far from the current implementation for me to implement, because I'm really unlikely to have time to think through the design for that. But I can fix merge
and setindex
to work better for DimStack
in a few minutes.
But about the stack. A stack is essentially a NamedTuple
of AbstractArray
s. You can't set fields of a NamedTuple
. But it has to be a NamedTuple
rather than a Dict
so that indexing like you do there is super fast, and so the whole object can be used as a GPU kernel argument.
Arrays are the privileged mutable part of this package. What you really seem to want is a MutableDimStack
, that is backed by a Dict
rather than a NamedTuple
. This will be much slower to index into. Like 100x slower, because the NamedTuple
indexing compiles away. But it could be useful to have?
We can instead make the immutable syntax better, like this really should work and is a tiny fix:
s = merge(s, (one=d1,))
You can't change fields of a NamedTuple but here (I think) the fields are pointers / references to the underlying Arrays.
With your example s2 = merge(s, DimStack((; four=4s[:one])))
if you change a cell in s, the same cell changes in s2.
So if you can change one cell in an Array then why not the whole array. (without altering the NamedTuple that points to it.)
Will try set up an example tomorrow, and will look at the things you suggest. bed time over here :)
Well you can always update a whole array with a broadcast if it already exists:
s[:one] .= d1
But this is attempting to change a pointer to point to a different array, by running setindex!
on an immutable NamedTuple
:
s[:one] = d1
Thats just not possible: we need to make a new NamedTuple
with a pointer to the new array, which is what merge
does.
Would you consider allowing broadcast_dims to be applied to Arrays and NamedTuples with dimension indices coming from the array or tuple values ? Something like
using DimensionalData, NamedTupleTools
using DimensionalData: DimArray, broadcast_dims
function LH_Dimension(x)
dimname = Symbol(join(x))
Dict( dimname => x ) |> NamedTuple
end
DimArray(x::Array) = DimArray( x, LH_Dimension(x) )
DimArray(x::NamedTuple) = DimArray( [x...], [keys(x)...] |> LH_Dimension )
u = Union{AbstractDimArray,Array,NamedTuple}
LH_broadcast_dims( f, x::u, y::u ) = broadcast_dims( f, DimArray(x), DimArray(y) )
Use case: create a DimArray to hold connections across different Servers and Databases.
Servers = ( Prod = "Prod_string", Dev = "Dev_string", UAT = "UAT_string" )
DBs = [:Load, :Rates, :Warehouse]
Connect(Server,DB) = "$Server $DB" #would actually be ODBC...
x = LH_broadcast_dims( Connect, Servers, DBs )
x[At(:Prod),At(:Rates)]
The LH_Dimension function is messy because I have to create a name for each dimension, which has to be a variable name in a NamedTuple. (maybe there's a way around this I'm not aware of).
Closing this as too long and complicated to be actionable. If you have any single contained feature requests, please write them up one at a time.
Closing for now.
1 @rtransform
@rtransform in DataFramesMeta.jl supports row level expressions that create new columns in an existing dataframe. Example
Could a similar macro @layer_transform add new layers to an existing DimStack. The DimStack documentation example has three layers. With such a macro it could be created as:
2 enum dimensions
Could an enum be used as a dimension. Example
The advantage of this is an enum can then be used as a structure field type. e.g.
The possible values of x1 and x2 are then restricted to the values of the dimension. (A Primary Key - Foreign Key constraint).
3 Automatic joining on common dimensions.
Consider the DimStack above. and another DimStack sX defined over dimension X with layers l1 l2
since sX is has dimension X which is common to sXY could the layers of sX be used within sXY. something like:
And could the layers of sXY be used in sX within aggregation statements, as the result would be aggregated over dimension Y.
4 to and from DataFrame
DataFrame
stack
and 'unstack' pivot columns to rows and vice versa. Similarly, could there be a function likeThat takes a DataFrame and a list of columns that become dimensions, with the remaining columns becoming layers ? and
That converts all dimensions and layers to DataFrame columns with the number of rows being the product of the dimensions.