hydromatic / morel

Standard ML interpreter, with relational extensions, implemented in Java
Apache License 2.0
294 stars 15 forks source link

`into` clause #171

Closed segeljakt closed 8 months ago

segeljakt commented 2 years ago

Have you thought about adding something like an into clause for piping a relation into a function and getting the output? Maybe it could be nice to break up queries into multiple subqueries 🤔

let clean items =
    from item in items
    where item.data >= 0

let process items =
    from item in items
    group item.key
    compute sum of item.data

let pipeline items =
    from item in items
    into clean
    into process

Also, maybe you could define recursive queries with it, maybe something like this (might need correction)

let transitive_closure nodes =
    from node in nodes, neighbors in node.neighbors
    yield neighbors
    into transitive_closure
julianhyde commented 2 years ago

I hadn't thought of this, but it's a nice idea. Allow me to drill into it a bit.

Meaning of multiple into clauses

In your

into clean
into process

example did you intend both into commands to receive the same input, or the output of into clean being the input to into process? I can see pros/cons for both. But the former is fairly similar to compute (see #69). We could phrase it as follows:

fun pipeline items =
    from item in items
    compute a = clean, b = process

(I switched from your OCaml-like let to Morel's fun. I'm open to let but let's talk about that in a different thread.)

Intermediate functions

I like the idea of dividing the pipeline into functions, as opposed to intermediate values. I haven't figured out whether Morel is eager or lazy, but functions are definitely lazy, and are aggressively inlined, so the effect is very much like defining views in SQL or macros in Lisp. This is important because it allows the optimizer to push down conditions, prune projects, and so forth.

If the pipeline were in a single computation unit, then intermediate values would be fine also.

Pipelines as steps

For some problems, it makes more sense to write a relational programs into a series of steps rather than a deeply nested query. Apache Pig Latin uses this idiom, for example:

lines = LOAD '/user/maria_dev/g.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

If people want to write programs in this idiom, Morel would happily support it.

Postfix function application

It's possible that into is nothing more than syntactic sugar, where you apply a function by writing is after its argument. For example, for any value of foo,

from e in emps
  where e.sal > 50
  into foo

is equivalent to

foo (from e in emps
  where e.sal > 50)

There's nothing from with syntactic sugar. (Morel's from is syntactic sugar for map and filter pipelines.) It is valuable, especially if foo is several lines long, and this allows me to move those lines down the page, so the "consumer" code is written after the "producer" code.

Is into a clause of from like where and group?

After an into, do you think we should still be a from expression? For example, could we write a where or group right after an into? I think there's value to it, but the syntax would need some work. For example,

from e in emps
  where e.sal > 50
  into foo
  where e.deptno = 10

doesn't work because the e iteration variable isn't in scope after the into, and into doesn't give us a way to declare a new iteration variable.

Maybe into is a terminating step (as is yield in many situations).

Or maybe into is a general-purpose infix operator. If a into b is shorthand for b a, then I could write (x - y) into abs as shorthand for abs (x - y).

What languages have into?

Was into inspired by any other languages you are familiar with?

SQL has into, for example

SELECT COUNT(*)  INTO c
FROM Emp

(where c is a variable in the enclosing programming language) but this has a procedural feel to it that doesn't seem appropriate for Morel.

segeljakt commented 2 years ago

Thank you for the very interesting answer 🙂

Was into inspired by any other languages you are familiar with?

I should probably have mentioned it, the language I found into in is Tremor. Tremor uses it to create streaming pipelines.

select event
from input
where event.ok == true
into output;

did you intend both into commands to receive the same input, or the output of into clean being the input to into process?

Hmm, originally I thought about the second one. Maybe the first could make sense if clean/process were concurrent entities like actors instead of pure functions 🤔

For some problems, it makes more sense to write a relational programs into a series of steps rather than a deeply nested query. Apache Pig Latin uses this idiom, for example:

lines = LOAD '/user/maria_dev/g.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

If people want to write programs in this idiom, Morel would happily support it.

This looks a lot like Tremor.

(I switched from your OCaml-like let to Morel's fun. I'm open to let but let's talk about that in a different thread.)

Haha, I think I confused them, I prefer fun also.

doesn't work because the e iteration variable isn't in scope after the into, and into doesn't give us a way to declare a new iteration variable.

I completely missed this point. The idea I had with into was to allow the query to keep chaining clauses. Maybe it would have made more sense to write it as:

fun pipeline items =
    from item0 in items,
         item1 in clean item,
         item2 in process item
    yield item2

In this case, clean and process take one item as input instead of a relation of items. I wonder if this changes what you can express 🤔 By the way, maybe a stupid question. Is there a reason why from can only occur at the start of the query?

Or maybe into is a general-purpose infix operator. If a into b is shorthand for b a, then I could write (x - y) into abs as shorthand for abs (x - y).

This is also a possibility. Some languages use |>. Not sure what's the best

One last thing, unrelated to this issue: https://github.com/hydromatic/morel/issues/172

julianhyde commented 2 years ago

Is there a reason why from can only occur at the start of the query?

By definition, from occurs at the start of a query (also known as a 'from expression', or, if you like, 'from comprehension') because from creates a query.

A query is just syntactic sugar. For instance,

List.filter (fn e => #deptno e = 30) emps

is equivalent to

from e in emps
  where e.deptno = 30

(The example is from slide 23 of my talk at Strange Loop 2021.) The from expression binds the variable e so that I can apply the 'filter' relational operator more concisely than writing a lambda. I can easily add another relational operator to the pipeline, like this:

from e in emps
  where e.deptno = 30
  yield {e.deptno, double_sal = e.sal * 2}

whereas the same operation using List.map is clumsy because it requires another lambda and nested application:

List.map
  (fn e => {deptno = #deptno e, double_sal = (#sal e) * 2})
  (List.filter (fn e => #deptno e = 30) emps)

Nothing prevents you from expressing your relational operations using function application (it makes sense for set operations like union) but for most relational operators the comprehension syntax is preferable. Because from is concise you can easily mix comprehensions with higher-order functions.

julianhyde commented 2 years ago

Your proposed syntax

fun pipeline items =
    from item in items,
         item in clean item,
         item in process item

doesn't have the desired effect, because the comma ',' operator creates a cartesian product. If I make the names unique, like this:

fun pipeline items =
    from item in items,
         item2 in clean item,
         item3 in process item2

then the pipeline would produce triples with fields (item, item2, item3), which is probably not what you want.

For your original example, you could write

from item in process
  (from item in clean items)

but the from syntax isn't gaining you anything, so you could write

process (clean items)

or

(process o clean) items

Still, the comprehension syntax is useful if you are doing one or two relational operations each step, as in this modified version of your original example:

let clean items =
    from item in items
    where item.data >= 0

let process items =
    from item in items
    group item.key
    compute sum of item.data

let pipeline items =
    from item in items
    where item.data2 < 0
    into clean
    where item.data3 > 100 
    into process

We could write

let pipeline items =
   from item in process (
     from item in clean (
       from item in items
       where item.data2 < 0)
     where item.data3 > 100) 

But better, I think, to use into:

let pipeline items =
  from item in items
    where item.data2 < 0
    into item in clean
    where item.data3 > 100
    into item in process

I have given into an in clause so that it can bind a variable:

from x in list
  (* clauses here... *)
  into y in f
  (* more clauses here... *)

is equivalent to

from y in f (
  from x in list
  (* clauses here... *) )
 (* more clauses here... *)
segeljakt commented 2 years ago

Thanks for the elaboration. It makes sense now 👍 also, the into <id> in <expr> syntax is interesting. I'm not sure if it reads as nice as the other clauses, but it definitely seems useful for eliminating nesting (similar to what you discussed in your talk).

julianhyde commented 2 years ago

I also don't know whether the syntax is perfect but let's give it a try. If you'd like to submit a PR I'll include it in Morel, and we'll see whether it's useful. We can evolve the syntax later.

segeljakt commented 2 years ago

Ok, I need to fix a thing in my work at the moment but I promise to look into it as soon as I have time 👍

julianhyde commented 2 years ago

Don't feel obligated. I'm just saying contributions are welcome, and this feature may not be perfect, but it will improve the language, so I would welcome it.

julianhyde commented 11 months ago

@snth, We were chatting about PRQL's | operator a couple of weeks ago and it reminded me of the into operator proposed for Morel. I know they're not directly comparable - newlines being special in PRQL, for instance - but I would like to hear your thoughts on into.

julianhyde commented 8 months ago

I think we will need two keywords: through to call a function in the middle of a pipeline, and into at the end of the pipeline.

First, into.

from <scans>
  <steps>
  into <expression>

is equivalent to

(<expression>) (from <scans> <steps>)

For example,

(*) sum adds a list of integers
sum [1, 2, 4];
> val it = 7 : int

(*) 'into' applies a function to its input, collected into a list
from e in [1, 2, 4]
  into sum;
> val i = 7

(*) rewrite 'into' to apply the function directly
sum (from e in [1, 2, 4]);
> val i = 7;

(*) 'into' is equivalent to existing keyword 'compute'
from e in [1, 2, 4]
  compute sum;
> val i = 7

(*) Actually, "a into b" is equivalent to "(b) a" for any types, not just lists
explode "abc";
> val it = [#"a",#"b",#"c"] : char list

"abc" into explode;
> val it = [#"a",#"b",#"c"] : char list

The original example becomes

fun clean items =
    from item in items
    where item.data >= 0

fun process items =
    from item in items
    group item.key
    compute sum of item.data

fun pipeline items =
    from item in items
    into clean
    into process

Next, through is similar to into, but has a pattern so that following steps can refer to the data items:

<expression1>
  through <pattern> in <expression2>
  <steps>

is equivalent to

from <pattern> in (<expression2>) (<expression1>)
  <steps>

For example, if we want to append a where step to the example, we need to convert into process into through item in process:

fun pipeline items =
    from item in items
    into clean
    through item in process
    where item.key > 100
snth commented 8 months ago

Hi @julianhyde ,

I'm sorry, I did look at this issue soon after your ping but didn't want to write an incomplete response and then life got in the way.

We do have a keyword for this in PRQL.

It's documented here PRQL/variables

The current PRQL syntax is based on the discussion in https://github.com/PRQL/prql/issues/2427.

I had some further ideas in https://github.com/PRQL/prql/issues/2668 but they haven't gone anywhere.

julianhyde commented 8 months ago

Really appreciate the feedbac, @snth! It's funny that Morel had variables from the start and add PRQL added them later, whereas PRQL had a pipe operator from the start and Morel is (with through) adding something like it.