tomjaguarpaw / haskell-opaleye

Other
601 stars 115 forks source link

Implement ordering within aggregations #153

Closed hesselink closed 8 years ago

hesselink commented 8 years ago

I'm trying to tackle ordering within aggregations. This is relevant for some aggregations like arrayAgg and stringAgg. The postgres documentation states that it is "usually" enough to just aggregate an ordered subquery to get the result you want. However, I just ran into a case where it isn't, and it depends on the data, not the query. Additionally, there's no clear reason visible to me: adding one record messes up the ordering of another. So I've been looking into what postgres recommends as the fail-safe solution: ordering within aggregations.

The SQL syntax for this is something like array_agg(col_name ORDER BY other_col DESC, third_col ASC). I added this to the PrimQuery and Sql types without issue. However, I run into problems near the surface syntax.

Initially I considered adding something like orderAggregate :: Order ? -> Aggregator (Column a) (Column b) -> Aggregator (Column a) (Column b). However, the type variable at the question mark is the problem. Just allowing it to be Column a obviously isn't enough, since you usually want to order by a different column.

I've now got something with the signature orderAggregate :: Order c -> Aggregator c d -> Aggregator c d. This means that you can use all columns, however, the order then applies to all aggregation operations in the Aggregator. I've got this working, and indeed that's what it generates.

Do you have any ideas how I could have this work per aggregation, but still have access to all the columns to use for ordering?

hesselink commented 8 years ago

You can see my code here.

tomjaguarpaw commented 8 years ago

What does it even mean to have each aggregator work with a different ordering? Does it scan the table multiple times?

hesselink commented 8 years ago

I think so, it has to right? I haven't looked at any plans though, just read the documentation.

tomjaguarpaw commented 8 years ago

It's times like these I just want to give up and write my own relational database engine.

tomjaguarpaw commented 8 years ago

What's the problem with orderAggregate :: Order c -> Aggregator c d -> Aggregator c d? I don't see why it would apply to all the columns in the aggregator.

hesselink commented 8 years ago

Say the aggregator is p2 (arrayAgg, arrayAgg) :: Aggregator (Column PGInt4, Column PGInt4) (Column (PGArray PGInt4), (PGArray PGInt4)), for example. If I now apply asc fst to this, which of the aggregations would it order? The only sensible thing to do is both, right?

tomjaguarpaw commented 8 years ago

Oh, I completely misunderstood. In my head you were proposing something like

orderArrayAgg :: Order a -> (a -> Column b) -> Aggregator a (Column b)

Would that work?

[EDIT: garbled the first one]

hesselink commented 8 years ago

What kind of aggregation would that create?

tomjaguarpaw commented 8 years ago

orderArrayAgg o f would aggregate the Column picked out by f into an array whose order is given by o.

hesselink commented 8 years ago

Hmm, ok, so that would mean that something like orderArrayAgg (asc fst) fst :: Aggregator (Column PGInt4, Column PGInt4) (Column (PGArray PGInt4), (PGArray PGInt4)) would do an array aggregate ordered by column 1 on column 1. But what would it do on column two? It says it's an Aggregator for both columns, so does the second one also do an array aggregate? What if I want a different aggregate there? Or a different order?

tomjaguarpaw commented 8 years ago

I don't understand your type. The type should be

orderArrayAgg (asc fst) fst :: Aggregator (Column PGInt4, Column PGInt4) (Column (PGArray PGInt4))
tomjaguarpaw commented 8 years ago

I got my original type wrong. Perhaps that confused you. It should be

orderArrayAgg :: Order a -> (a -> Column b) -> Aggregator a (Column (PGArray b))
tomjaguarpaw commented 8 years ago

(And to answer your question: It doesn't do anything with the second column. The second column is ignored.)

hesselink commented 8 years ago

Ah, I see, then I misunderstood. So it does a combination of aggregation and projection. That's very different to what Opaleye does now, if I understand correctly. Now, if you have a tuple of columns, when you aggregate you'll always get a tuple of aggregated columns back. With this function, you'll aggregate and you'll get a single column. How would you combine these then to get aggregates of both columns? I see that Aggregate has an Applicative instance, will that do the right thing?

tomjaguarpaw commented 8 years ago

That's very different to what Opaleye does now

No, not at all. It's exactly the same. lmap does projection before the aggregation takes place.

I see that Aggregate has an Applicative instance, will that do the right thing?

Indeed.

hesselink commented 8 years ago

Ah yes, I don't think of Profunctor as something I can actually use :) I'll try to implement the function with your signature and see where I end up. It does mean duplicating all aggregators where ordering makes a difference, although currently there are only two, so that's not too bad.

tomjaguarpaw commented 8 years ago

With an aggregator of this form it may be difficult to use p2, p3, pWhatever, etc.. It may turn out easier to use the pattern presented in https://github.com/tomjaguarpaw/haskell-opaleye/blob/master/Doc/Tutorial/TutorialBasicMonomorphic.lhs in the definition of aggregateWidgets.

tomjaguarpaw commented 8 years ago

I don't think of Profunctor as something I can actually use :)

:)

It does mean duplicating all aggregators where ordering makes a difference

Yeah, that's a bit annoying.

hesselink commented 8 years ago

Ah yes, that example is great, and looks exactly like what I imagined when I thought of using the Applicative instance.

hesselink commented 8 years ago

I realized that using lmap and Applicative, I can actually apply different orderings to different parts of the aggregation using my initial code! You do the aggregation on a single column, lmap to the larger type including the column you want to order on, then order the aggregation. Finally you combine many of these together using the Applicative combinators.

So I think I'll keep my signature since it's more compositional and doesn't require duplicating the aggregation function. If you agree, I can send a PR against opaleye master.

tomjaguarpaw commented 8 years ago

What do you mean "then order the aggregation"? I thought this was exacly what we determined didn't work:

it is "usually" enough to just aggregate an ordered subquery to get the result you want. However, I just ran into a case where it isn't

hesselink commented 8 years ago

I'm using my original orderAggregate :: Order a -> Aggregate a b -> Aggregate a b, but with the code mentioned above, I'm able to apply a different ordering to each aggregate (which I thought I couldn't before). So you get something like:

x :: Aggregator (Column a, Column b) (Column (PGArray a), Column (PGArray a))
x = (,) <$> orderAggregate (asc snd) (lmap fst arrayAggGrouped)
        <*> orderAggregate (desc snd) (lmap fst arrayAggGrouped)

This would generate code like:

SELECT array_agg(a ORDER BY b ASC), array_agg(a ORDER BY b DESC)
FROM (SELECT a, b FROM ...)
tomjaguarpaw commented 8 years ago

I see. Can we convince ourselves that this always leads to syntactically valid SQL? For example, what does it mean to apply your orderAggregate to something that contains two separate aggregations?

hesselink commented 8 years ago

What do you mean "two separate aggregations"? If you mean this, it works and applies the order to all aggregations in the Aggregator:

x :: Aggregator (Column a, Column b) (Column (PGArray a), Column (PGArray b))
x = orderAggregate (asc snd) $ p2 (arrayAggGrouped, arrayAggGrouped)

This would generate code like:

SELECT array_agg(a ORDER BY b ASC), array_agg(b ORDER BY b ASC)
FROM (SELECT a, b FROM ...)
hesselink commented 8 years ago

As for producing syntactically (and semantically) value SQL: applying it repeatedly will overwrite the order, just like orderBy will. And all columns should be in scope as enforced by the first type argument of Aggregator. So there aren't any issues that I can see.

tomjaguarpaw commented 8 years ago

I see. I must say that I am fairly uncomfortable with overwriting order. It means that I can't write an abstract Aggregator whose implementation is hidden. Instead, users can poke around inside it by applying different orderAggregates from the outside.

hesselink commented 8 years ago

The same is true of Query, right? I see that as an advantage: it's composable.

tomjaguarpaw commented 8 years ago

No, aggregation on a Query doesn't affect anything on the inside.

tomjaguarpaw commented 8 years ago

... I mean ordering on a Query ...

hesselink commented 8 years ago

I mean you can apply ordering to a Query after the fact and change its behaviour.

hesselink commented 8 years ago

What do you mean with 'on this inside'? On the inside of what?

tomjaguarpaw commented 8 years ago

Can we take a step back? Can you explain your use case for two different orders in the same aggregation?

hesselink commented 8 years ago

I don't have one. I just implemented a way to allow specifying an order for an aggregation. I did this as a combinator, and it turns out that you can use it to specify multiple different orders for the different aggregations in a single query. Since this matches the power that postgres gives you, I thought that was a very nice result: a simple combinator, and all the power of the underlying SQL. I'm now offering to also apply this code to opaleye master, but if you disagree with this approach I'm happy to keep it on our branch only, since it does what I need and the implementation is clean.

tomjaguarpaw commented 8 years ago

Ah yes, of course. Without it we don't have a way of forcing the aggregation order at all.

Keep it on your branch for now since I'm a bit uneasy about how Postgres handles this, but I'll think about it further.

tomjaguarpaw commented 8 years ago

By the way, I would be perfectly happy to merge

orderAggregate :: Order a -> Aggregator a b -> Query a -> Query b

if you wanted a compromise.

hesselink commented 8 years ago

That wouldn't allow multiple orders though.

tomjaguarpaw commented 8 years ago

Correct, which is why I'm more comfortable merging it :)

tomjaguarpaw commented 8 years ago

It seems like that would satisfy me whilst meeting all your current use cases. If you really need multiple orders in the future then we can reopen the discussion.

hesselink commented 8 years ago

I still don't understand your objection to multiple orders. What's the difference in safety of opaleye between generating array_agg(a ORDER BY b ASC), array_agg(a ORDER BY b ASC) and array_agg(a ORDER BY b ASC), array_agg(a ORDER BY b DESC)?

tomjaguarpaw commented 8 years ago

I'm not objecting to safety at this point (I wouldn't be surprised if it were safe, or if it weren't safe). However, the semantics are a bit odd, and I'd rather avoid thinking about them.

hesselink commented 8 years ago

It all feels a bit hand-wavy to me. I'll hold off on sending the PR then, since changing the code for no reason apparent to me feels like wasted effort, since I believe the current version is strictly better (and I trust postgres that if they document this, it will work sanely).

tomjaguarpaw commented 8 years ago

I am trying to be incredibly cautious with the denotational semantics of Opaleye. Being able to "order" an Aggregator complicates its meaning considerably.

hesselink commented 8 years ago

All right, it seems our goals are a bit different then: I'm mainly using it as a convenient, type checked SQL DSL. Since the (postgres) SQL allows ordering aggregates, it seems perfectly obvious to me to support this in opaleye as well. Be sure to let me know when/if you've worked out the semantics, and I can send the PR. Or you can just pick off the stuff from our branch.

tomjaguarpaw commented 8 years ago

I'm mainly using it as a convenient, type checked SQL DSL

Well yes, but it's a question of what you expect your types to mean :)

Anyway, here's a second possible compromise: We put your function Order a -> Aggregate a b -> Aggregate a b in Opaleye.Internal.Aggregate and we put my, less general, function Order a -> Aggregator a b -> Query a -> Query b in Opaleye.Aggregate. Then at least the functionality you have written is in the repo. Your branch can just reexport your version from Opaleye.Aggregate. I promise not to touch your version without telling you :)

How does that sound? (Of course I could just do this unilaterally, but if you think it's a good idea I'm likely to do it sooner.)

hesselink commented 8 years ago

That sounds very reasonable. Note that I don't need to do anything on my branch, since it's based on 0.3, not current master.

tomjaguarpaw commented 8 years ago

OK, sounds like a good plan then. What should I do? Copy your code across from your branch by hand?

hesselink commented 8 years ago

Sure, or you can cherry-pick the commit. Be sure to take the latest one since I force updated it. There's a decent chance it will apply cleanly.

Or I can do it in a couple of hours when I get home.

tomjaguarpaw commented 8 years ago

I don't think I'll do it in the next couple of hours, but whichever of us gets around to it first can do it.

hesselink commented 8 years ago

OK, I'm working on this now.

tomjaguarpaw commented 8 years ago

Fixed by https://github.com/tomjaguarpaw/haskell-opaleye/issues/155