tomjaguarpaw / haskell-opaleye

Other
599 stars 115 forks source link

How to aggregate on selected fields or joining table fields? #544

Closed stevemao closed 2 years ago

stevemao commented 2 years ago

The example of aggregation in the docs works well

aggregateWidgets = aggregate (pWidget Widget { style    = groupBy
                                              , color    = groupBy
                                              , location = count
                                              , quantity = arrayAgg
                                              , radius   = avg })
                              (selectTable widgetTable)

I want to extend the example to select some fields in the widgetTable, or joining other tables and select fields on those tables too. For simplicity, I'm selecting the original 5 fields just to demonstrate the problem.

aggregateWidgets = do
    widget <- selectTable widgetTable

    aggregate (pWidget Widget { style    = groupBy
                                              , color    = groupBy
                                              , location = count
                                              , quantity = arrayAgg
                                              , radius   = avg })
                              (return widget)

This does not aggregate properly anymore. It seems that return isn't the right choice here. What's the right way to do this?

tomjaguarpaw commented 2 years ago

Hmm, I'm not sure what you mean. Could you give an example with sample data perhaps? That is, write out some example data for the widget table, and the result that you want from your aggregation operation?

stevemao commented 2 years ago

yes, let me give more details here.

The table is the same as the example in the docs

-- Sequence and defined type
CREATE SEQUENCE IF NOT EXISTS "widgetTable_id_seq";

-- Table Definition
CREATE TABLE "public"."widgetTable" (
    "id" int4 NOT NULL DEFAULT nextval('"widgetTable_id_seq"'::regclass),
    "style" text NOT NULL,
    "color" text NOT NULL,
    "location" text NOT NULL,
    "quantity" int4 NOT NULL,
    "radius" float8 NOT NULL,
    PRIMARY KEY ("id")
);

And I added some sample data

INSERT INTO "public"."widgetTable" ("id", "style", "color", "location", "quantity", "radius") VALUES
('1', 'style', 'red', 'Syd', '2', '5'),
('2', 'style', 'red', 'Melb', '1', '10');

If I copy the aggregation code from the example

aggregateWidgets :: Select (Widget (Field SqlText) (Field SqlText) (Field SqlInt8)
                                   (Field (SqlArray SqlInt4)) (Field SqlFloat8))
aggregateWidgets = do
    aggregate (pWidget Widget { style    = groupBy
                                              , color    = groupBy
                                              , location = count
                                              , quantity = arrayAgg
                                              , radius   = avg })
                              (selectTable widgetTable)

it generates the right SQL

SELECT
"result0_2" as "result1_3",
"result1_2" as "result2_3",
"result2_2" as "result3_3",
"result3_2" as "result4_3",
"result4_2" as "result5_3"
FROM (SELECT
      *
      FROM (SELECT
            "inner0_2" as "result0_2",
            "inner1_2" as "result1_2",
            COUNT("inner2_2") as "result2_2",
            ARRAY_AGG("inner3_2") as "result3_2",
            AVG("inner4_2") as "result4_2"
            FROM (SELECT
                  "style0_1" as "inner0_2",
                  "color1_1" as "inner1_2",
                  "location2_1" as "inner2_2",
                  "quantity3_1" as "inner3_2",
                  "radius4_1" as "inner4_2",
                  *
                  FROM (SELECT
                        *
                        FROM (SELECT
                              "style" as "style0_1",
                              "color" as "color1_1",
                              "location" as "location2_1",
                              "quantity" as "quantity3_1",
                              "radius" as "radius4_1"
                              FROM "widgetTable" as "T1") as "T1") as "T1") as "T1"
            GROUP BY "inner0_2",
                     "inner1_2") as "T1") as "T1";

And it returns the expected result

style   red 2   {2,1}   7.5

However, if I want to only select 3 of the fields, or if I want to also select a field that's from another table with a join, I write something like this

aggregateWidgets :: Select (Widget (Field SqlText) (Field SqlText) (Field SqlInt8)
                                   (Field (SqlArray SqlInt4)) (Field SqlFloat8))
aggregateWidgets = do
    widget <- selectTable widgetTable
    aggregate (pWidget Widget { style    = groupBy
                                              , color    = groupBy
                                              , location = count
                                              , quantity = arrayAgg
                                              , radius   = avg })
                              (return (Widget (style widget) (color widget) (location widget) (quantity widget) (radius widget)))

Note I'm still selecting the same fields just for simplicity. I would create another Widget type that excludes the fields I don't want or include the fields from other tables.

This doesn't generate the right SQL

SELECT
"result0_2" as "result1_3",
"result1_2" as "result2_3",
"result2_2" as "result3_3",
"result3_2" as "result4_3",
"result4_2" as "result5_3"
FROM (SELECT
      *
      FROM (SELECT
            "style" as "style0_1",
            "color" as "color1_1",
            "location" as "location2_1",
            "quantity" as "quantity3_1",
            "radius" as "radius4_1"
            FROM "widgetTable" as "T1") as "T1",
           LATERAL
           (SELECT
            "inner0_2" as "result0_2",
            "inner1_2" as "result1_2",
            COUNT("inner2_2") as "result2_2",
            ARRAY_AGG("inner3_2") as "result3_2",
            AVG("inner4_2") as "result4_2"
            FROM (SELECT
                  "style0_1" as "inner0_2",
                  "color1_1" as "inner1_2",
                  "location2_1" as "inner2_2",
                  "quantity3_1" as "inner3_2",
                  "radius4_1" as "inner4_2",
                  *
                  FROM (SELECT
                        0) as "T1") as "T1"
            GROUP BY "inner0_2",
                     "inner1_2") as "T2") as "T1";

And returns wrong result

style   red 1   {2} 5
style   red 1   {1} 10

What is the right way to achieve this?

tomjaguarpaw commented 2 years ago

I'm still not entirely sure what you're trying to do, but do you want to project the columns after the aggregation has happened?

https://github.com/tomjaguarpaw/haskell-opaleye/blob/5fc848b4d64e430260d7674be0bc4dd9ddcdd056/Doc/Tutorial/TutorialBasic.lhs#L603-L612

You can also project the columns before if you like, but note, the select from the table has to be inside the aggregation.

https://github.com/tomjaguarpaw/haskell-opaleye/blob/5fc848b4d64e430260d7674be0bc4dd9ddcdd056/Doc/Tutorial/TutorialBasic.lhs#L614-L618

tomjaguarpaw commented 2 years ago

One rule of thumb to get to the correct solution is that the Select monad behaves like the list monad. That might help you understand where to place selects for the desired outcome.

stevemao commented 2 years ago

Thanks! I think the example of "project the columns before" would work. Let me try it tomorrow.

stevemao commented 2 years ago

Hi @tomjaguarpaw , it seems to work with fields from other tables too

I slightly modified the example.

personAndBirthday ::
   Select (Field (SqlArray SqlText), Field SqlInt4, Field (SqlArray SqlText), Field SqlDate)
personAndBirthday = do
   aggregate (p4 (arrayAgg, groupBy, arrayAgg, groupBy))
            $ do
               (name, age, address) <- personSelect
               birthday             <- birthdaySelect

               where_ $ name .== bdName birthday
               pure (name, age, address, bdDay birthday)

This works and I think it's worth to add it to the docs.

the select from the table has to be inside the aggregation

I missed the inner Select completely.

One rule of thumb to get to the correct solution is that the Select monad behaves like the list monad.

I still don't understand what it means to have an inner select. Perhaps we can compare 2 examples

Example 1. This is working as expected

personAndBirthday ::
   Select (Field (SqlArray SqlText), Field SqlInt4, Field (SqlArray SqlText), Field SqlDate)
personAndBirthday = do
   aggregate (p4 (arrayAgg, groupBy, arrayAgg, groupBy))
            $ do
               (name, age, address) <- personSelect
               birthday             <- birthdaySelect

               where_ $ name .== bdName birthday
               pure (name, age, address, bdDay birthday)

Example 2. I cannot see a use case. But what does it mean when we move the select out side of the aggregation?

personAndBirthday ::
   Select (Field (SqlArray SqlText), Field SqlInt4, Field (SqlArray SqlText), Field SqlDate)
personAndBirthday = do
   (name, age, address) <- personSelect
   birthday             <- birthdaySelect

   where_ $ name .== bdName birthday

   aggregate (p4 (arrayAgg, groupBy, arrayAgg, groupBy))
            $ do
               pure (name, age, address, bdDay birthday)
stevemao commented 2 years ago

I think this is how I understand it. Monad carries the context. When we do aggregate, we need to carry the context (where_ and all the Selects) to it so that it can generate the correct aggregation script.

tomjaguarpaw commented 2 years ago

It's simpler than that. Think of Select as the list monad. Then you will have the intuition you need to understand where the select needs to go. For example

{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE MultiParamTypeClasses #-}

module Example where

import Opaleye

import Control.Monad (guard)
import Database.Postgres.Temp as Temp
import Database.PostgreSQL.Simple as PGS

listSum :: [Int] -> [Int]
listSum = pure . Prelude.sum

listInner :: [Int]
listInner = do
  x <- [1, 2, 3, 4, 5]
  guard (x <= 3)
  listSum (pure x)

listOuter :: [Int]
listOuter = do
  listSum $ do
    x <- [1, 2, 3, 4, 5]
    guard (x <= 3)
    pure x

opaleyeInner :: Select (Field SqlInt4)
opaleyeInner = do
  x <- values [1, 2, 3, 4, 5]
  where_ (x .<= 3)
  aggregate Opaleye.sum (pure x)

opaleyeOuter :: Select (Field SqlInt4)
opaleyeOuter = do
  aggregate Opaleye.sum $ do
    x <- values [1, 2, 3, 4, 5]
    where_ (x .<= 3)
    pure x

runExamples :: IO ()
runExamples = do
  _ <- Temp.with $ \db -> do
    conn <- PGS.connectPostgreSQL (Temp.toConnectionString db)
    putStrLn $ "listInner: " ++ show listInner
    putStrLn $ "listOuter: " ++ show listOuter
    putStr "opaleyeInner: "
    print =<< runSelectI conn opaleyeInner
    putStr "opaleyeInner: "
    print =<< runSelectI conn opaleyeOuter

  pure ()
> runExamples 
listInner: [1,2,3]
listOuter: [6]
opaleyeInner: [1,2,3]
opaleyeInner: [6]
tomjaguarpaw commented 2 years ago

And to put it a bit more directly, pure creates a single row, just like pure in the list monad creates a singleton list. If you do the select outside the aggregate and you have just pure inside the aggregate then you are aggregating over a single row, which is probably not what you want!

stevemao commented 2 years ago

Thanks for the detailed explanation!