Open davidanthoff opened 7 years ago
I'm mainly still struggling with the design for this... But I kind of have an idea now, would be great to hear your feedback on that. And yes, I could generally use help with the whole package/ecosystem, so that would be most welcome and I would not mind at all explaining things. Are you coming to juliacon this year? That might be an efficient way.
Ok, here is my idea. For the case where you want to summarize a grouped result, you can today use the following syntax:
@from i in df begin
@group i by i.state into g
@select {age=mean(map(j->j.age,g)), oldest=maximum(map(j->j.age,g))}
@collect DataFrame
end
Some of these reduction functions in base allow you to pass a function that transforms things before the reduction happens, e.g. there is mean(f::Function, v)
, so one could rewrite the @select
statement as
@select {age=mean(j->j.age,g), oldest=maximum(map(j->j.age,g))}
That is a bit better, but many of the reduction functions in base don't support this, and I find it still clunky.
I think there are two ways out of this:
.a
is shorthand for x->x.a
in some languages. That in combination with a systematic attempt to add methods that take a transformation function to all reduction functions in base (like the mean
function) would generally allow us to write things like @select {age=mean(.age, g), oldest=maximum(.age, g)}
. I think that would be nice. But .a
doesn't parse right now, so this would require a change in base.x..a
is shorthand for map(i->i.a, x)
. So that would allow us to write things like @select {age=mean(g..age), oldest=maximum(g..age)}
. I think I like that syntax best, actually. Benefit of this one is that a..b
parses currently, so we could implement that transformation as part of the @from
macro, i.e. we could do this right now. And maybe someday that syntax will make its way into base, which would be great, of course.I thought for a while that the story for summarizing a whole query is more tricky. I could add a @summarize
statement that can be used inside a query, but in general I'm not happy how I'm terminating queries these days, and this would be another statement that terminates a query, and I'm just not super happy with the whole design there. But, I just merged an initial version of a piping syntax (the goal is to add another full dplyr like user API to the package eventually). And with this piping syntax I think one could move the summarize functionality outside of the query itself, and could have something like this:
df |> @query(i, begin
@select i
end) |>
@summarize(age=mean(age), oldest=maximum(age)
This whole piping syntax works already on master
, the only thing missing is the @summarize
macro here. The caveat would be that @summarize
only works with table sources. My thinking right now is that in general I'll make the dplyr interface to only work with tables, and only the existing LINQ style interface would support all the other, non table sources and targets that it supports right now.
One general question is what @summarize
returns. I was thinking right now to just return a named tuple. But that is different from dplyr, where it returns a table with one row. The equivalent in my system would be that @summarize
would return an iterator with one row that returns a named tuple. My gut feeling is that returning just a named tuple is easier, but I'm not sure...
For the grouped summary story, see #121.
hi @davidanthoff so i finally got round to look at this. on the upside: i'm able to run the tests. on the downside, I dont' even know where to start with the code. :-( It's very advanced with metaprogramming, maybe a bit too much for me - I'd like to learn but not sure it's worth your time, as I said. (not at juliaCon unfortunately)
So I find both the piping and your solution number 2 above appealing. number 2 seems the right thing for summaries within a query. So just to get the main setup right:
@from
.translate_query
on the expression body so constructed. I suspect this is where you unpick the expression and figure out what to do?a..b
to that translation phase. i'm sure there's a good reason for why are there 7 phases. I'm wonder if it wouldn't be possible to turn a vector of namedtuples to a named tuple of vectors.
Figuring out some story about ungroup would be useful too.
I'm wonder if it wouldn't be possible to turn a vector of namedtuples to a named tuple of vectors.
Hm, that would imply yet another allocation, right? Unless it would be a named tuple of vector views...
Figuring out some story about ungroup would be useful too.
That should be easily done via a nested @from
clause that flattens groups.
I guess what you really might want is generators (row.name for row in i).
I tried out the generators in LazyQuery seems to be working fine on master. I was hoping you could help me out with the ungroup. Say for example I use LazyQuery to do something like this:
@chain @evaluate begin
DataFrame(
a = [1, 1, 2, 2],
b = [1, 2, 3, 4],
c = [4, 3, 2, 1]
)
query(it)
@group it a
@make_from it a d = collect(b) / sum(b) e = collect(c) / sum(c)
collect(it, DataFrame)
end
I end up with nested vectors in d and e. How would I ungroup them? If you want you can send back query syntax and I can macroexpand my way through it.
Something like this:
@from i in df begin
@group i by i.a into g
@select {g.key, some_avg = mean(j->j.b, g), group = g} into i
@from j in i.group
@select {i.key, i.some_avg, j.b, j.c}
@collect DataFrame
end
Not a perfect match, but it shows the general idea.
One problematic aspect here is that this won't work if you have more than one vector in the group that you want to unroll. I.e. in my example, only group
is a vector that I want to unroll (but it is a vector of named tuples). In your example you have two vectors you want to unroll (d
and e
), and that doesn't work with the machinery we have right now.
Right, so then the solution would be to take non-grouping columns, zip them back up into a vector of named tuples, unnest, then unzip them out again?
Yeah... Not ideal...
Ok, well, I've decided add an additional dataframes backed for lazyquery to fully support grouped operations. It seems like to me that the namedtuples row approach isn't really compatible with grouping/ungrouping.
hi @davidanthoff are you looking for some help with that? the cost for you is of course that you would lose some time managing and explaining what I do. of course assuming that this does not require a fundamental change of your setup. cheers