Handy functions from dplyr

bramtayl commented 5 years ago

Going through the dplyr manual, I see several functions that might add to query. These include sample, bind_rows, bind_cols, rename, mutate, slice, n, and top_n. I'm not sure if they are all necessary, but some of them might be nice and I could pitch in here.

bramtayl commented 5 years ago

Oh and all the different joins too

davidanthoff commented 5 years ago

YES! I think that is actually the area where we could add the most value right now to Query.jl.

I have thought a lot about mutate and to some degree select, and not at all about the others. Here is my current thinking:

First, I think we should try to implement all the mutate and select variants in the front end only. I think it should be feasible that they all end up as @map calls under the hood, and in that way we actually don't have to add anything to QueryOperators.jl, or do any work on the backends.

Then, I think we could probably as a first step try to add features like that as new functions that manipulate NamedTuples, so that they can be used from within @map, before we start to add helper functions like @mutate and @select.

I think for starters, if we had a type stable merge function for NamedTuples, it would go a long way. Say merge((a=1,b=2),(c=3))==(a=1,b=2,c=3). Once we have that, we could add some syntax to {} to make it easier to use that. For example {a..., b..., x=3} could be translated to merge(a, b, (x=3,)) in the various Query.jl macros.

Another area would be selecting subsets of columns. We could either have something like startswith((foo1=1, bar=2, foo2=3), :foo)==(foo1=1,foo2=3), or something like (foo1=1, bar=2, foo2=3)[startswith(:foo)]==(foo1=1,foo2=3). I'm not sure which of these is better. In a query it might look like @map(startswith(_, :foo)) or @map(_[startswith(:foo)]). I think I like the first one better, but not sure... The second approach would be more in line with this, which probably would also be worthwhile... In general I think we need a lot more features to select columns, but we probably should iterate a bit with various designs?

Maybe as a first step I should create queryverse/NamedTupleHelpers.jl, where we could play with some of these methods, and where they could have their home?

queryverse / Query.jl

Handy functions from dplyr #192