Provide an equivalent to dplyrs summarise function

floswald commented 7 years ago

hi @davidanthoff are you looking for some help with that? the cost for you is of course that you would lose some time managing and explaining what I do. of course assuming that this does not require a fundamental change of your setup. cheers

davidanthoff commented 7 years ago

I'm mainly still struggling with the design for this... But I kind of have an idea now, would be great to hear your feedback on that. And yes, I could generally use help with the whole package/ecosystem, so that would be most welcome and I would not mind at all explaining things. Are you coming to juliacon this year? That might be an efficient way.

Ok, here is my idea. For the case where you want to summarize a grouped result, you can today use the following syntax:

@from i in df begin
    @group i by i.state into g
    @select {age=mean(map(j->j.age,g)), oldest=maximum(map(j->j.age,g))}
    @collect DataFrame
end

Some of these reduction functions in base allow you to pass a function that transforms things before the reduction happens, e.g. there is mean(f::Function, v), so one could rewrite the @select statement as

    @select {age=mean(j->j.age,g), oldest=maximum(map(j->j.age,g))}

That is a bit better, but many of the reduction functions in base don't support this, and I find it still clunky.

I think there are two ways out of this:

@JeffBezanson mentioned in julialang/julia#21875 that .a is shorthand for x->x.a in some languages. That in combination with a systematic attempt to add methods that take a transformation function to all reduction functions in base (like the mean function) would generally allow us to write things like @select {age=mean(.age, g), oldest=maximum(.age, g)}. I think that would be nice. But .a doesn't parse right now, so this would require a change in base.
Another idea (that I think I saw @JeffBezanson make somewhere else) would be that x..a is shorthand for map(i->i.a, x). So that would allow us to write things like @select {age=mean(g..age), oldest=maximum(g..age)}. I think I like that syntax best, actually. Benefit of this one is that a..b parses currently, so we could implement that transformation as part of the @from macro, i.e. we could do this right now. And maybe someday that syntax will make its way into base, which would be great, of course.

I thought for a while that the story for summarizing a whole query is more tricky. I could add a @summarize statement that can be used inside a query, but in general I'm not happy how I'm terminating queries these days, and this would be another statement that terminates a query, and I'm just not super happy with the whole design there. But, I just merged an initial version of a piping syntax (the goal is to add another full dplyr like user API to the package eventually). And with this piping syntax I think one could move the summarize functionality outside of the query itself, and could have something like this:

df |> @query(i, begin
        @select i
    end) |>
    @summarize(age=mean(age), oldest=maximum(age)

This whole piping syntax works already on master, the only thing missing is the @summarize macro here. The caveat would be that @summarize only works with table sources. My thinking right now is that in general I'll make the dplyr interface to only work with tables, and only the existing LINQ style interface would support all the other, non table sources and targets that it supports right now.

One general question is what @summarize returns. I was thinking right now to just return a named tuple. But that is different from dplyr, where it returns a table with one row. The equivalent in my system would be that @summarize would return an iterator with one row that returns a named tuple. My gut feeling is that returning just a named tuple is easier, but I'm not sure...

davidanthoff commented 7 years ago

For the grouped summary story, see #121.

floswald commented 7 years ago

hi @davidanthoff so i finally got round to look at this. on the upside: i'm able to run the tests. on the downside, I dont' even know where to start with the code. :-( It's very advanced with metaprogramming, maybe a bit too much for me - I'd like to learn but not sure it's worth your time, as I said. (not at juliaCon unfortunately)

So I find both the piping and your solution number 2 above appealing. number 2 seems the right thing for summaries within a query. So just to get the main setup right:

you first construct an expression with a macro. for example @from.
then you call translate_queryon the expression body so constructed. I suspect this is where you unpick the expression and figure out what to do?
so what you did in #121 is to add a..b to that translation phase. i'm sure there's a good reason for why are there 7 phases.
is that enough? I mean in terms of making this work, is that all that needs to be done? (amazing!)

bramtayl commented 7 years ago

I'm wonder if it wouldn't be possible to turn a vector of namedtuples to a named tuple of vectors.

bramtayl commented 7 years ago

I think I've got a solution here

bramtayl commented 7 years ago

Figuring out some story about ungroup would be useful too.

davidanthoff commented 7 years ago

I'm wonder if it wouldn't be possible to turn a vector of namedtuples to a named tuple of vectors.

Hm, that would imply yet another allocation, right? Unless it would be a named tuple of vector views...

Figuring out some story about ungroup would be useful too.

That should be easily done via a nested @from clause that flattens groups.

bramtayl commented 7 years ago

I guess what you really might want is generators (row.name for row in i).

bramtayl commented 7 years ago

I tried out the generators in LazyQuery seems to be working fine on master. I was hoping you could help me out with the ungroup. Say for example I use LazyQuery to do something like this:

@chain @evaluate begin
    DataFrame(
        a = [1, 1, 2, 2],
        b = [1, 2, 3, 4],
        c = [4, 3, 2, 1]
    )
    query(it)
    @group it a
    @make_from it a d = collect(b) / sum(b) e = collect(c) / sum(c)
    collect(it, DataFrame)
end

I end up with nested vectors in d and e. How would I ungroup them? If you want you can send back query syntax and I can macroexpand my way through it.

davidanthoff commented 7 years ago

Something like this:

@from i in df begin                                            
    @group i by i.a into g                                         
    @select {g.key, some_avg = mean(j->j.b, g), group = g} into i  
    @from j in i.group                                             
    @select {i.key, i.some_avg, j.b, j.c}                          
    @collect DataFrame                                             
end

Not a perfect match, but it shows the general idea.

One problematic aspect here is that this won't work if you have more than one vector in the group that you want to unroll. I.e. in my example, only group is a vector that I want to unroll (but it is a vector of named tuples). In your example you have two vectors you want to unroll (d and e), and that doesn't work with the machinery we have right now.

bramtayl commented 7 years ago

Right, so then the solution would be to take non-grouping columns, zip them back up into a vector of named tuples, unnest, then unzip them out again?

davidanthoff commented 7 years ago

Yeah... Not ideal...

bramtayl commented 7 years ago

Ok, well, I've decided add an additional dataframes backed for lazyquery to fully support grouped operations. It seems like to me that the namedtuples row approach isn't really compatible with grouping/ungrouping.

queryverse / Query.jl

Provide an equivalent to dplyrs summarise function #84