tomjaguarpaw / haskell-opaleye

Other
602 stars 115 forks source link

Add support for window functions #559

Closed tomjaguarpaw closed 2 years ago

tomjaguarpaw commented 2 years ago

Most of the implementation is due to Shane O'Brien (@duairc)

This is an API for window functions that I am comfortable supporting. The downside is that it is more "heavyweight" than the API proposed by @duairc, in the same way that the Opaleye aggregation API is more heavyweight than the rel8 aggregation API. We use Profunctors whilst they use Applicatives. The up side is that I can understand what the semantics are supposed to be. rel8 can use the internals to support its preferred API style, naturally.

Notable differences from @duairc's PR are

I would welcome comments, particularly from @duairc. I plan to merge this in September, hopefully the first half thereof.

tomjaguarpaw commented 2 years ago

I'm becoming more convinced that I should restore orderPartitionBy and drop the Order a argument from over.

tomjaguarpaw commented 2 years ago

This would close https://github.com/tomjaguarpaw/haskell-opaleye/issues/149

tomjaguarpaw commented 2 years ago

Hmm, no I don't want to restore orderPartitionBy because I am confused by the meaning of

((w1 <> orderPartitionBy o1)
  <*> (w2 <> orderPartitionBy o2))
  <* orderPartitionBy o

I suppose it has to mean

(w1 <> orderPartitionBy o1 <> orderPartitionBy o)
  <*> (w2 <> orderPartitionBy o2 <> orderPartitionBy o)

but I don't find that transformation particularly domain-relevant, so I would tend to prefer it not be expressible.

tomjaguarpaw commented 2 years ago

I believe that rel8 can recover the original API by defining

type Rel8Window a = (WindowFunction () a, Rel8Partition)
type Rel8Partition = (Window (), Order ())

rel8Over :: Rel8Window a -> Rel8Partition -> Rel8Window a
rel8Over (w, p1) p2 = (w, p1 <> p2)

nthValue :: Field_ n a -> Field SqlInt4 -> Rel8Window a
nthValue value n = (contramap (const value) (nthValue n), mempty)

rel8Window =
  -- some combination of over, runWindows and windowsApply
  -- similar to Rel8.Table.Opaleye.aggregator
shane-circuithub commented 2 years ago

Did you understand my comment on the original PR?

The reason that Opaleye has a Profunctor based interface for Aggregator makes sense. We considered doing the same in Rel8. The key reason motivating our consideration was:

  • Prevents non-aggregated columns. The only thing you can pure is columns outside the aggregation - which is fine.

That makes sense for aggregation because you need all columns in the aggregation to be either aggregated or GROUP BY'd.

But windows are different to aggregations. There's no concept of "grouping". It's perfectly valid to have columns that are neither grouped nor the result of a window function. The input query and the output query have the exact same number of rows — there is a 1-to-1 correlation between them. So it's valid to want to be able to pure columns from inside the "aggregation".

All this is to say, the Profunctor interface for window functions doesn't really make sense. A plain Applicative is actually the correct abstraction here. Note that this is also different to Rel8's interface for aggregation (we had tried to use an Applicative originally but couldn't make it work because it was the wrong abstraction).

tomjaguarpaw commented 2 years ago

Did you understand https://github.com/tomjaguarpaw/haskell-opaleye/pull/552#issuecomment-1183262072 on the original PR?

Yes I did. My original discomfort with the Applicative API was a suspicion that it would lead to scoping errors in the absence of LATERAL, but after reading your comment and having thought about it more I realise that's not the case.

Now I realise that my discomfort is that I don't understand the denotation of the Applicative API. For the Profunctor API I can imagine a denotation where

I am comfortable with the Profunctor API because this denotation reassures me that the API is coherent. It also gives me something to test in the property tests. Is there a similar denotation that would make me similarly comfortable with the Applicative API?

shane-circuithub commented 2 years ago

Maybe instead of thinking of window functions as WindowFunction a b or a -> Window b, we could think of them as Window a -> b. I agree with your intuition of a Window as an equivalence relation + an ordering that is used to select a sublist of the original [a]. Window functions are functions which take one of these sublists as input rather than a single row, but the thing that they return is a single row. This is not inconsistent with the Fold a b idea because Fold a b is basically [a] -> b.

We could actually have an API like the following:

over :: Partition a -> Ordering a -> Windows a (Window a)
identity :: Windows a a
cumulative :: Aggregator a b -> Window a -> b
runWindows :: Windows a b -> Select a -> Select b

And using it would look like:

{-# language ApplicativeDo #-}
{-# language OverloadedRecordDot #-}

data Test = Test
  { foo :: Field SqlText
  , bar :: Field SqlText
  , x :: Field SqlInt4
  , y :: Field SqlInt4
  }

windowTest :: Select Test -> Select Test
windowTest = runWindows $ do
  input <- identity
  overFoo <- over (partitionBy (.foo)) mempty
  overBar <- over (partitionBy (.bar)) mempty
  pure Test
    { foo = input.foo
    , bar = input.bar
    , x = cumulative sum $ (.x) <$> overFoo
    , y = cumulative sum $ (.y) <$> overBar
    }

What do you think about this as a Profunctor based API?

shane-circuithub commented 2 years ago

Am I correct in saying that this is how I would do it with the current API?

windowTest = Select Test -> Select Test
windowTest = runWindows $ do
  input <- over noWindowFunction mempty mempty
  x <- over (aggregatorWindowFunction sum (.x)) (partitionBy (.foo)) mempty
  y <- over (aggregatorWindowFunction sum (.y)) (partitionBy (.bar)) mempty
  pure Test
    { foo = input.foo
    , bar = input.bar
    , x = x
    , y = y
    }
tomjaguarpaw commented 2 years ago

Am I correct in saying that this is how I would do it with the current API?

Yes, that looks right.

cumulative :: Aggregator a b -> Window a -> b

I don't understand the denotation of this. A Window a is an equivalence relation on a where each equivalence class is ordered. Aggregator a b can cumulatively aggregate over each equivalence class. So how come it returns only a single b? It should return one b for each element of the equivalence class. (That's basically what WindowFunction a b denotes). Worse, what is it suppose to denote when a has no rows?

tomjaguarpaw commented 2 years ago

To be clear, I have nothing against you reproducing your original API in rel8. The only internal feature you'd have to use to do that is windowsApply, which I'm happy to support. On the other hand if you're trying to help me design a better API for Opaleye with the kind of semantics I'm comfortable with, then thanks!

shane-circuithub commented 2 years ago

Worse, what is it suppose to denote when a has no rows?

How could a have no rows?

shane-circuithub commented 2 years ago
cumulative :: Aggregator a b -> Window a -> b

I don't understand the denotation of this.

To me, the denotation of this is simply ([a] -> b) -> [a] -> b. To me, Window a is [a]. I get what you're saying about ordering and an equivalence class, but that's a means to the end, not the end itself. The end is a (possibly reordered) sublist of rows from the original query.

shane-circuithub commented 2 years ago

Here's a very rough translation of what I mean into Haskell:

import Control.Applicative (liftA2)
import Data.Foldable (find)
import Data.Function (on)
import Data.List.NonEmpty (groupBy, head, sortBy, takeWhile)
import Prelude hiding (head, takeWhile)

newtype Windows a b = Windows ([a] -> a -> b)

instance Functor (Windows a) where
  fmap f (Windows a) = Windows $ fmap (fmap f) a

instance Applicative (Windows a) where
  pure = Windows . pure . pure
  liftA2 f (Windows a) (Windows b) = Windows $ liftA2 (liftA2 f) a b

over :: (Eq partition, Ord ordering)
  => (a -> partition) -> (a -> ordering) -> Windows a [a]
over partitionBy orderBy = Windows $ \rows row ->
  let
    partitions =
      sortBy (compare `on` orderBy) <$> groupBy ((==) `on` partitionBy) rows
    partition = find ((== partitionBy row) . partitionBy . head) partitions
    subrows = maybe [row] (takeWhile ((<= orderBy row) . orderBy)) partition
   in
     subrows

runWindows :: Windows a b -> [a] -> [b]
runWindows (Windows f) rows = f rows <$> rows
tomjaguarpaw commented 2 years ago

Worse, what is it suppose to denote when a has no rows?

How could a have no rows?

Ah, I misunderstood what you meant by Window a. In your formulation each Window a is an ordered (non-empty) collection of rows, and Windows (Window a) is a collection of such.

tomjaguarpaw commented 2 years ago

Ok, nice! I think this works. This looks very promising. So we have the following denotation:

This is much nicer than my API. I'll convert my code over to this API (unless you fancy doing it in the meantime).

shane-circuithub commented 2 years ago

For what it's worth, the Haskell code above doesn't quite work because the groupBy in base only considers the equivalence of successive elements. Here's a fully worked version that uses Map from containers to do the grouping:

{-# language ApplicativeDo #-}
{-# language OverloadedRecordDot #-}

import Control.Applicative (liftA2)
import Data.Foldable (find, fold)
import Data.Function (on)
import Data.List.NonEmpty (groupBy, head, sortBy, takeWhile)
import Data.Maybe (fromMaybe)
import Prelude hiding (head, takeWhile)

import Data.Map.Strict (Map)
import qualified Data.Map.Strict as Map

newtype Windows a b = Windows ([a] -> a -> b)

instance Functor (Windows a) where
  fmap f (Windows a) = Windows $ fmap (fmap f) a

instance Applicative (Windows a) where
  pure = Windows . pure . pure
  liftA2 f (Windows a) (Windows b) = Windows $ liftA2 (liftA2 f) a b

identity :: Windows a a
identity = Windows $ const id

over :: (Ord partition, Ord ordering)
  => (a -> partition) -> (a -> ordering) -> Windows a [a]
over partitionBy orderBy = Windows $ \rows ->
  let
    partitions = toMultiMapOn orderBy <$> toMultiMapOn partitionBy rows
  in
    \row ->
      let
        partition = Map.lookup (partitionBy row) partitions
        window = fromMaybe [row] $ partition >>= lookupLTE (orderBy row)
      in
        window

toMultiMapOn :: Ord k => (a -> k) -> [a] -> Map k [a]
toMultiMapOn f = Map.fromListWith (++) . map (liftA2 (,) f pure)

lookupLTE :: Ord k => k -> Map k [a] -> Maybe [a]
lookupLTE k m = case Map.splitLookup k m of
  (as, Nothing, _) | Map.null as -> Nothing
  (as, a, _) -> Just (fold as ++ fold a)

runWindows :: Windows a b -> [a] -> [b]
runWindows (Windows f) rows = f rows <$> rows

data Test = Test
  { foo :: String
  , bar :: String
  , x :: Int
  , y :: Int
  }
  deriving Show

windowTest :: [Test] -> [Test]
windowTest = runWindows $ do
  input <- identity
  overFoo <- over (.foo) (.bar)
  overBar <- over (.bar) (.foo)
  pure Test
    { foo = input.foo
    , bar = input.bar
    , x = sum $ (.x) <$> overFoo
    , y = product $ (.y) <$> overBar
    }

tests :: [Test]
tests =
  [ Test "a" "a" 1 2
  , Test "a" "b" 2 3
  , Test "a" "c" 4 5
  , Test "b" "a" 8 7
  , Test "b" "b" 16 11
  , Test "b" "c" 32 13
  , Test "c" "a" 64 17
  , Test "c" "b" 128 19
  , Test "c" "c" 256 23
  ]
tomjaguarpaw commented 2 years ago

I think this works.

Hmm, but then what should happen if we try to use runWindows at b ~ Window something?

runWindows :: Windows a b -> Select a -> Select b

Is it really meaningful to end up with a Select (Window something)?

shane-circuithub commented 2 years ago

Ah, you're right, I'm pretty sure that's nonsense. Sorry, I didn't mean to waste your time proposing a nonsensical API, it's more that your questions got me thinking about the denotation in a way that I hadn't really thought about before and I was thinking out loud in the comments. I think we're both better able to understand where the other is coming from now though. I can at least see how to implement Rel8's API on top of what you have here. I also still suspect that there's a simpler API that's possible that has a reasonable denotation. I'll think about this more over the coming days and let know if I come up with something solid.

shane-circuithub commented 2 years ago

Having said that, I think you might be able to do some sort of ST type trick to salvage the API above, but I don't know if it's worth it.

tomjaguarpaw commented 2 years ago

No need to apologise. I appreciate your contributions and your probing questions! I'd love to what your thoughts are after giving it more consideration. The Profunctor API is really quite cumbersome so improving it is desirable.

shane-circuithub commented 2 years ago

I haven't tried this but I think this should stop you being able to return a Window from runWindows:

over :: Partition a -> Ordering a -> Windows s a (Window s a)
identity :: Windows s a a
cumulative :: Aggregator a b -> Window s a -> b
runWindows :: (forall s. Windows s a b) -> Select a -> Select b
tomjaguarpaw commented 2 years ago

Yes, that looks promising. It might also work for aggregation. The "ST trick" is quite a digression from the current Opaleye API style (and might even make it more difficult for rel8 to consume) so I think I'll stick with what we have in this PR and consider "ST tricks" as a major API overhaul at a later date.

tomjaguarpaw commented 2 years ago

This was merged by https://github.com/tomjaguarpaw/haskell-opaleye/commit/b8999e752b793f9d9b2a7b08bc6a2e2404a038cd