risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.78k stars 561 forks source link

Support SQL UDF #10151

Open xxchan opened 1 year ago

xxchan commented 1 year ago

Motivation

Reusability

e.g.,

Easier to migrate from other SQL dialects to RisingWave

Different SQL dialects (Flink/Hive/Clickhouse) have different syntax sugars. Usually there are equivalent functions, but rewriting SQL can be error-prone. If we allow them to define a SQL UDF as an alias, they don't need to modify their SQLs at all.

Users are very familiar with their original SQL, and the less changes made to the SQL logic, the lower the migration cost will be.

This is one example https://github.com/risingwavelabs/risingwave/pull/10145

Feature subset we can support

LANGUAGE SQL is also quite complex, as it allows multiple statements (DMLs are allowed). We won't want to support this now (and don't support multi-statement).

SQL functions execute an arbitrary list of SQL statements, returning the result of the last query in the list. In the simple (non-set) case, the first row of the last query's result will be returned. (Bear in mind that “the first row” of a multirow result is not well-defined unless you use ORDER BY.) If the last query happens to return no rows at all, the null value will be returned.

But a subset of SQL UDF won't be hard to implement. Specifically, we can support those can be inlined as a subquery. And we can implement the features one by one. Even we only support the simplest forms, it's still useful.

CREATE FUNCTION name ( args ) RETURNS type
LANGUAGE SQL
    | AS 'definition'
    | sql_body
  } ...

sql_body is like

RETURN expression

A note from PG doc about this form:

This form is parsed at function definition time, the string constant form is parsed at execution time; therefore this form cannot support polymorphic argument types and other constructs that are not resolvable at function definition time. This form tracks dependencies between the function and objects used in the function body, so DROP ... CASCADE will work correctly, whereas the form using string literals may leave dangling functions. Finally, this form is more compatible with the SQL standard and other SQL implementations.

definition is a SQL string

1 returns a value (1 row 1 col)

1.1 RETURN expression

This should be trivial.

1.2 SELECT an expression

SELECT $1+$2;

1.3 SELECT multiple expressions

e.g.,

create function f(out a int, out b int)  language sql as 'select 1,2';

This needs to change multiple cols to a struct. Cannot simply convert select f(), ... to select (select 1,2), ... (more than 1 cols).

1.4 polymorphic arguments

e.g.,

CREATE FUNCTION is_greater(anyelement, anyelement) RETURNS boolean AS $$
    SELECT $1 > $2;
$$ LANGUAGE SQL;

CREATE FUNCTION dup (f1 anyelement, OUT f2 anyelement, OUT f3 anyarray)
AS 'select $1, array[$1,$1]' LANGUAGE SQL;

1.5 SELECT with FROM

e.g., agg and point-get query. might consider together with 2.

2. returns a table (setof) (n rows 1 col)

This includes: table functions and more general SELECT subquery.

Postgres executes the SQL UDF in ProjectSet. It might be possible to convert it to a JOIN. Consider it later.

e.g.,

create function f() returns setof int language sql as 'select * from t';

Cannot simply convert to select (select * from t), ... (more than 1 rows).

Some points might need to take care of

Reference

xxchan commented 1 year ago

Some offline discussions:

neverchanje commented 1 year ago

Btw, I think we can probably encourage our users to use DBT's macro in the case where they need UDF. My major concern with this feature is the engineering complexity that it'll introduce to our system.

xxchan commented 1 year ago

Not very familiar with DBT. I'm wondering is it easy to adopt for users unfamiliar with it? Will it require users to change their workflow? Or can be used as a handy Swiss Army Knife? 🤔

Complexity is definitely the concern, and also Eric's reason for opposing. But maybe the simplist expression case is most useful and has very low complexity.

e.g.,

Yes -- say I need to parse jsonb to handle missing values and casting, I don't necessarily want extra piece of infrastructure (python server) when a sql expression could be enough

https://risingwave-community.slack.com/archives/C02T3F7UYM6/p1689610339419499?thread_ts=1689610181.827109&cid=C02T3F7UYM6

For such use case, maybe users don't want extra DBT neither.

github-actions[bot] commented 2 months ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄

st1page commented 2 months ago

is it done? Can we close the issue?

xzhseh commented 2 months ago

is it done? Can we close the issue?

Not yet for some specific syntax, see details here.

I can pick this up when I have time.

st1page commented 2 months ago

is it done? Can we close the issue?

Not yet for some specific syntax, see details here.

I can pick this up when I have time.

Oh, I just thought we had already finished it. Thanks ❤️