Open mfatihaktas opened 4 months ago
Generally to introduce new APIs, we like to do a survey of other systems to get an idea of whether it's something that is supportable across backends.
I think other systems support match_recognize
, can you look into whether some of our non-Flink backends support this functionality?
I think other systems support match_recognize, can you look into whether some of our non-Flink backends support this functionality?
Sure, will do and share my findings here.
Edit: Updated the description with Other backends
.
Is your feature request related to a problem?
Pattern recognition in Flink enables searching for a set of event patterns in data streams. Flink comes with a complex event processing (CEP) library which allows for pattern detection in event streams. Flink consolidates CEP and SQL API using the
MATCH_RECOGNIZE
clause for complex event processing in SQL.More information is available on the Flink doc.
Describe the solution you'd like
Ibis would support pattern recognition for Flink backend with an API similar to the following
Note: As can be seen in the doc, Flink provides a rich set of pattern matching functionality through
MATCH_RECOGNIZE
. It is likely that this issue will need to be addressed in multiple iterations.Other backends
Supporting
MATCH_RECOGNIZE
:Not supporting
MATCH_RECOGNIZE
:Acceptance criteria
match_recognize()
.match_recognize()
is implemented for Flink together with another backend.MATCH_RECOGNIZE
components in FlinkIn Flink,
MATCH_RECOGNIZE
consists of6
components that are listed above with brief explanations. Copying an example SQL below to be used as a reference while going through the list.Partitioning: Logical partitioning of the rows with
PARTITION BY
clause. Allows for dividing the rows intopartitions
and looking for patterns within each partition (similar toGROUP BY
for aggregations). If partitioning is not used,MATCH_RECOGNIZE
clause will be translated into a non-parallel operator to ensure global ordering, which won't be performant. [Flink doc]Ordering: Ordering of the rows with
ORDER BY
clause. The first argument toORDER BY
must be a time attribute with ascending ordering. [Flink doc]Output mode: Describes how many rows should be emitted for every found match. Currently only supports
ONE ROW PER MATCH
mode that will always produce one output summary row for each found match. [Flink doc]Define:
DEFINE
clause specifies conditions that rows have to fulfill in order to be classified to a corresponding pattern variable. If a condition is not defined for a pattern variable, a default condition will be used which evaluates to true for every row.Every match pattern is constructed from basic building blocks, called
pattern variables
, to which operators (quantifiers and other modifiers) can be applied. The whole pattern must be enclosed in brackets.An example pattern with pattern variables
A, B, C, D
could look like:The pattern composition is complex with multiple components:
Measures: The MEASURES clause defines what will be included in the output of a matching pattern. It can project columns and define expressions for evaluation.
After match strategy:
AFTER MATCH SKIP
clause specifies where to start a new matching procedure after a complete match was found.There are four different strategies:
SKIP PAST LAST ROW
-- resumes the pattern matching at the next row after the last row of the current match.SKIP TO NEXT ROW
-- continues searching for a new match starting at the next row after the starting row of the match.SKIP TO LAST <pattern-variable>
-- resumes the pattern matching at the last row that is mapped to the specified pattern variable.SKIP TO FIRST <pattern-variable>
-- resumes the pattern matching at the first row that is mapped to the specified pattern variable.[Flink doc]
Other concerns that we might want to consider:
Time attributes: In order to apply some subsequent queries on top of
MATCH_RECOGNIZE
, it might be required to use time attributes. [Flink doc]Controlling memory consumption: Memory consumption is an important consideration when writing
MATCH_RECOGNIZE
queries, as the space of potential matches is built in a breadth-first-like manner. Having that in mind, one must make sure that the pattern can finish. Preferably with a reasonable number of rows mapped to the match as they have to fit into memory. [Flink doc]API design:
match_recognize()
We would take all the required components of
MATCH_RECOGNIZE
clause through function arguments asSome of the components, such as
partition_by
,order_by
,after_match_strategy
andoutput_mode
can be expressed withstr
or the existing Ibis types (e.g.,ir.Column
). However forpattern
andmeasures
, we need to provide new expression types because these are constructed with complex semantics. Pattern is defined in terms ofpattern variables
and the relation(s) over them and between them, e.g.TODO
: What alternative API's can/should we consider? One alternative would be allowing to buildmatch_recognize
with subsequent.
operators, e.g.,expr = table.partition_by(...).order_by(...).pattern(...) ...
, but I think this would be confusing for the users (what components are required or optional?) and would not fit the other functions inibis/expr/api.py
.Code of Conduct