Closed segeljakt closed 8 months ago
I hadn't thought of this, but it's a nice idea. Allow me to drill into it a bit.
into
clausesIn your
into clean
into process
example did you intend both into
commands to receive the same input, or the output of into clean
being the input to into process
? I can see pros/cons for both. But the former is fairly similar to compute
(see #69). We could phrase it as follows:
fun pipeline items =
from item in items
compute a = clean, b = process
(I switched from your OCaml-like let
to Morel's fun
. I'm open to let
but let's talk about that in a different thread.)
I like the idea of dividing the pipeline into functions, as opposed to intermediate values. I haven't figured out whether Morel is eager or lazy, but functions are definitely lazy, and are aggressively inlined, so the effect is very much like defining views in SQL or macros in Lisp. This is important because it allows the optimizer to push down conditions, prune projects, and so forth.
If the pipeline were in a single computation unit, then intermediate values would be fine also.
For some problems, it makes more sense to write a relational programs into a series of steps rather than a deeply nested query. Apache Pig Latin uses this idiom, for example:
lines = LOAD '/user/maria_dev/g.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
If people want to write programs in this idiom, Morel would happily support it.
It's possible that into
is nothing more than syntactic sugar, where you apply a function by writing is after its argument. For example, for any value of foo
,
from e in emps
where e.sal > 50
into foo
is equivalent to
foo (from e in emps
where e.sal > 50)
There's nothing from with syntactic sugar. (Morel's from
is syntactic sugar for map
and filter
pipelines.) It is valuable, especially if foo
is several lines long, and this allows me to move those lines down the page, so the "consumer" code is written after the "producer" code.
into
a clause of from
like where
and group
?After an into
, do you think we should still be a from
expression? For example, could we write a where
or group
right after an into
? I think there's value to it, but the syntax would need some work. For example,
from e in emps
where e.sal > 50
into foo
where e.deptno = 10
doesn't work because the e
iteration variable isn't in scope after the into
, and into
doesn't give us a way to declare a new iteration variable.
Maybe into
is a terminating step (as is yield
in many situations).
Or maybe into
is a general-purpose infix operator. If a into b
is shorthand for b a
, then I could write (x - y) into abs
as shorthand for abs (x - y)
.
into
?Was into
inspired by any other languages you are familiar with?
SQL has into
, for example
SELECT COUNT(*) INTO c
FROM Emp
(where c
is a variable in the enclosing programming language) but this has a procedural feel to it that doesn't seem appropriate for Morel.
Thank you for the very interesting answer 🙂
Was into inspired by any other languages you are familiar with?
I should probably have mentioned it, the language I found into
in is Tremor. Tremor uses it to create streaming pipelines.
select event
from input
where event.ok == true
into output;
did you intend both into commands to receive the same input, or the output of into clean being the input to into process?
Hmm, originally I thought about the second one. Maybe the first could make sense if clean
/process
were concurrent entities like actors instead of pure functions 🤔
For some problems, it makes more sense to write a relational programs into a series of steps rather than a deeply nested query. Apache Pig Latin uses this idiom, for example:
lines = LOAD '/user/maria_dev/g.txt' AS (line:chararray); words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word; grouped = GROUP words BY word; wordcount = FOREACH grouped GENERATE group, COUNT(words); DUMP wordcount;
If people want to write programs in this idiom, Morel would happily support it.
This looks a lot like Tremor.
(I switched from your OCaml-like let to Morel's fun. I'm open to let but let's talk about that in a different thread.)
Haha, I think I confused them, I prefer fun
also.
doesn't work because the
e
iteration variable isn't in scope after theinto
, andinto
doesn't give us a way to declare a new iteration variable.
I completely missed this point. The idea I had with into
was to allow the query to keep chaining clauses. Maybe it would have made more sense to write it as:
fun pipeline items =
from item0 in items,
item1 in clean item,
item2 in process item
yield item2
In this case, clean
and process
take one item as input instead of a relation of items. I wonder if this changes what you can express 🤔 By the way, maybe a stupid question. Is there a reason why from
can only occur at the start of the query?
Or maybe into is a general-purpose infix operator. If a into b is shorthand for b a, then I could write (x - y) into abs as shorthand for abs (x - y).
This is also a possibility. Some languages use |>
. Not sure what's the best
One last thing, unrelated to this issue: https://github.com/hydromatic/morel/issues/172
Is there a reason why
from
can only occur at the start of the query?
By definition, from
occurs at the start of a query (also known as a 'from expression', or, if you like, 'from comprehension') because from
creates a query.
A query is just syntactic sugar. For instance,
List.filter (fn e => #deptno e = 30) emps
is equivalent to
from e in emps
where e.deptno = 30
(The example is from slide 23 of my talk at Strange Loop 2021.) The from
expression binds the variable e
so that I can apply the 'filter' relational operator more concisely than writing a lambda. I can easily add another relational operator to the pipeline, like this:
from e in emps
where e.deptno = 30
yield {e.deptno, double_sal = e.sal * 2}
whereas the same operation using List.map
is clumsy because it requires another lambda and nested application:
List.map
(fn e => {deptno = #deptno e, double_sal = (#sal e) * 2})
(List.filter (fn e => #deptno e = 30) emps)
Nothing prevents you from expressing your relational operations using function application (it makes sense for set operations like union
) but for most relational operators the comprehension syntax is preferable. Because from
is concise you can easily mix comprehensions with higher-order functions.
Your proposed syntax
fun pipeline items =
from item in items,
item in clean item,
item in process item
doesn't have the desired effect, because the comma ',' operator creates a cartesian product. If I make the names unique, like this:
fun pipeline items =
from item in items,
item2 in clean item,
item3 in process item2
then the pipeline would produce triples with fields (item
, item2
, item3
), which is probably not what you want.
For your original example, you could write
from item in process
(from item in clean items)
but the from
syntax isn't gaining you anything, so you could write
process (clean items)
or
(process o clean) items
Still, the comprehension syntax is useful if you are doing one or two relational operations each step, as in this modified version of your original example:
let clean items =
from item in items
where item.data >= 0
let process items =
from item in items
group item.key
compute sum of item.data
let pipeline items =
from item in items
where item.data2 < 0
into clean
where item.data3 > 100
into process
We could write
let pipeline items =
from item in process (
from item in clean (
from item in items
where item.data2 < 0)
where item.data3 > 100)
But better, I think, to use into
:
let pipeline items =
from item in items
where item.data2 < 0
into item in clean
where item.data3 > 100
into item in process
I have given into
an in
clause so that it can bind a variable:
from x in list
(* clauses here... *)
into y in f
(* more clauses here... *)
is equivalent to
from y in f (
from x in list
(* clauses here... *) )
(* more clauses here... *)
Thanks for the elaboration. It makes sense now 👍 also, the into <id> in <expr>
syntax is interesting. I'm not sure if it reads as nice as the other clauses, but it definitely seems useful for eliminating nesting (similar to what you discussed in your talk).
I also don't know whether the syntax is perfect but let's give it a try. If you'd like to submit a PR I'll include it in Morel, and we'll see whether it's useful. We can evolve the syntax later.
Ok, I need to fix a thing in my work at the moment but I promise to look into it as soon as I have time 👍
Don't feel obligated. I'm just saying contributions are welcome, and this feature may not be perfect, but it will improve the language, so I would welcome it.
@snth, We were chatting about PRQL's |
operator a couple of weeks ago and it reminded me of the into
operator proposed for Morel. I know they're not directly comparable - newlines being special in PRQL, for instance - but I would like to hear your thoughts on into
.
I think we will need two keywords: through
to call a function in the middle of a pipeline, and into
at the end of the pipeline.
First, into
.
from <scans>
<steps>
into <expression>
is equivalent to
(<expression>) (from <scans> <steps>)
For example,
(*) sum adds a list of integers
sum [1, 2, 4];
> val it = 7 : int
(*) 'into' applies a function to its input, collected into a list
from e in [1, 2, 4]
into sum;
> val i = 7
(*) rewrite 'into' to apply the function directly
sum (from e in [1, 2, 4]);
> val i = 7;
(*) 'into' is equivalent to existing keyword 'compute'
from e in [1, 2, 4]
compute sum;
> val i = 7
(*) Actually, "a into b" is equivalent to "(b) a" for any types, not just lists
explode "abc";
> val it = [#"a",#"b",#"c"] : char list
"abc" into explode;
> val it = [#"a",#"b",#"c"] : char list
The original example becomes
fun clean items =
from item in items
where item.data >= 0
fun process items =
from item in items
group item.key
compute sum of item.data
fun pipeline items =
from item in items
into clean
into process
Next, through
is similar to into
, but has a pattern so that following steps can refer to the data items:
<expression1>
through <pattern> in <expression2>
<steps>
is equivalent to
from <pattern> in (<expression2>) (<expression1>)
<steps>
For example, if we want to append a where
step to the example, we need to convert into process
into through item in process
:
fun pipeline items =
from item in items
into clean
through item in process
where item.key > 100
Hi @julianhyde ,
I'm sorry, I did look at this issue soon after your ping but didn't want to write an incomplete response and then life got in the way.
We do have a keyword for this in PRQL.
It's documented here PRQL/variables
The current PRQL syntax is based on the discussion in https://github.com/PRQL/prql/issues/2427.
I had some further ideas in https://github.com/PRQL/prql/issues/2668 but they haven't gone anywhere.
Really appreciate the feedbac, @snth! It's funny that Morel had variables from the start and add PRQL added them later, whereas PRQL had a pipe operator from the start and Morel is (with through
) adding something like it.
Have you thought about adding something like an
into
clause for piping a relation into a function and getting the output? Maybe it could be nice to break up queries into multiple subqueries 🤔Also, maybe you could define recursive queries with it, maybe something like this (might need correction)