tomjaguarpaw / haskell-opaleye

Other
599 stars 115 forks source link

Rewrite only references in aggregation, not all values #585

Closed shane-circuithub closed 4 months ago

shane-circuithub commented 5 months ago

In #576 I added preliminary support for ordered set aggregation functions. However, in the vast majority of cases this generates invalid SQL as things stand.

This PR contains two commits: the first adds a failing test demonstrating the problem, the second one fixes the problem.

To recap, recall PostgreSQL does not allow lateral references in aggregation functions, e.g., the following SQL is invalid:

SELECT
  result
FROM
  (
    VALUES
      (1),
      (2),
      (3)
  ) _(x),
  LATERAL (
    SELECT
      sum(x) AS result
  ) __

It fails with the error message ERROR: aggregate functions are not allowed in FROM clause of their own query level. Opaleye works around this limitation by rewriting the above query as follows:

SELECT
  result
FROM
  (
    VALUES
      (1),
      (2),
      (3)
  ) _(x),
  LATERAL (
    SELECT
      sum(inner1) AS result
    FROM
      (
        SELECT
          x AS inner1
      ) _
  ) __

The current implementation of this rewriting rewrites all arguments of aggregation functions regardless of whether they contain (potentially lateral) references or not. However, the same effect could be achieved if we only rewrote references and not all expressions. That's what this PR does.

This change is beneficial when using set aggregation functions such as percentile_cont. With the current rewriting scheme, Opaleye will generate:

SELECT
  result
FROM
  (
    VALUES
      (1),
      (2),
      (3)
  ) _(x),
  LATERAL (
    SELECT
      percentile_cont(inner1) WITHIN GROUP (ORDER BY inner2) AS result
    FROM
      (
        SELECT
          0.5 AS inner1,
          x AS inner2
      ) _
  ) __

Which fails with the error message:

ERROR:  column "_.inner1" must appear in the GROUP BY clause or be used in an aggregate function
LINE 12:       percentile_cont(inner1) WITHIN GROUP (ORDER BY inner2)...
                               ^
DETAIL:  Direct arguments of an ordered-set aggregate must use only grouped columns.

After the change in this PR, this instead becomes:

SELECT
  result
FROM
  (
    VALUES
      (1),
      (2),
      (3)
  ) _(x),
  LATERAL (
    SELECT
      percentile_cont(0.5) WITHIN GROUP (ORDER BY inner1) AS result
    FROM
      (
        SELECT
          x AS inner1
      ) _
  ) __

Which works.

Incidentally this is the exactly the same fix as the one proposed in #578, which I understand you consider a less than ideal fix for the problem it purports to solve, but I think the API changes that would result from your proposed solution in the comments (although it's correct) are bigger than I want to take on right now. In any case, this is the problem that I am eager to solve at this time, so would you consider this fix as a solution to the problem described above, even if you're not satisfied that it truly solves #578? It shouldn't break any existing code as far as I can see.

tomjaguarpaw commented 5 months ago

Thanks. This post contains a lot of repeated bits. I guess that's copy-paste error? If so, could you tidy it up?

shane-circuithub commented 5 months ago

Apologies for that, it should be fixed now.

tomjaguarpaw commented 5 months ago

Thanks. I'm very short on time and energy but I'd like to get this to a stage where you are at least unblocked for what you need to do.

I think the API changes that would result from your proposed solution in the comments (although it's correct) are bigger than I want to take on right now

When you say "take on" do you mean that the implementation of the changes to Opaleye (described in https://github.com/tomjaguarpaw/haskell-opaleye/pull/578#issuecomment-1782720333) are too big to take on, or once such changes to Opaleye are made, the changes required on the rel8 side are too big? I'm not particularly worried about the former. I think I can probably do it myself given a free weekend afternoon. On the other hand, I don't know if the new API will be useable by rel8 at all, in the way you want to use it.

tomjaguarpaw commented 5 months ago

Oh, and I incorporated your tests as "pending" into the test suite. Can you rebase your branch (or maybe all of them)?

shane-circuithub commented 5 months ago

When you say "take on" do you mean that the implementation of the changes to Opaleye (described in #578 (comment)) are too big to take on, or once such changes to Opaleye are made, the changes required on the rel8 side are too big? I'm not particularly worried about the former. I think I can probably do it myself given a free weekend afternoon. On the other hand, I don't know if the new API will be useable by rel8 at all, in the way you want to use it.

Yeah, I just mean on the Rel8 side of things. I actually don't care so much about the ability to use DISTINCT and ORDER BY together. I was aware of the problem but it was the ordered set aggregation that I was trying to solve, I just realised that my solution "solved" both problems but maybe I put too much emphasis on the former.

I've rebased this PR.

tomjaguarpaw commented 5 months ago

OK, I'll have a closer look at the weekend.

tomjaguarpaw commented 5 months ago

After refreshing my memory on this issue, the biggest objection I have is that it's still very easy to write crashing queries after this PR. The proposed change feels too invasive for a partial solution.

By way of alternative, I suggest excluding the Aggregator itself from what is rebound. I don't see any reason why we need to include it. I'll work on that now.

tomjaguarpaw commented 5 months ago

By way of alternative, I suggest excluding the Aggregator itself from what is rebound.

This is it. @duairc, please take a look. It fixes the particular test cases you wrote, at least. Does it work for your general use cases?

https://github.com/tomjaguarpaw/haskell-opaleye/pull/586

shane-circuithub commented 5 months ago

I've just tested this out. It did require some changes to Rel8 and to CircuitHub code that uses Rel8, but they were relatively tolerable. I would have preferred not to incur a Table Expr i constraint on aggregate but your solution does make a lot of sense and it isn't really the end of the world. So I'm happy to go with this if you're happy with it.

tomjaguarpaw commented 5 months ago

Yeah, it's unfortunate to pick up that extra constraint but it does seem like it plays a genuine role. The API is still not safe (lmapping Aggregators can lead to crashing queries) but at least now it's possible to write the queries you want, so it seems like the right first step.