Open mfatihaktas opened 8 months ago
It might be important to distinguish between system time (table history managed automatically by the DBMS system) and business/validity time managed by the application/user.
System time might be what you get from time travel system as implemented in Apache Iceberg, Delta Lake and SQL 2011 temporal tables.
Bi-temporal modeling can be useful in many contexts, and Ibis might be required to know how system and valid time ranges are handled differently by the underlying backend.
Tri-temporal modeling (with an additional decision time) might be also be useful but adds an other layer of complexity and I am not sure Ibis has to do anything about that extra modeling layer. I might be 100% managed by application code.
As far as I understand, it's already possible to do table_a.asof_join(table_b, on="time", by="entity_id")
in Ibis (using the duckdb backend for instance), but only for a regular time column (business/validity).
It's also possible to load a table for a given system time for some combination of data source formats / backends, for instance using ibis.read_delta(data_path, version=last_version_before_timestamp)
where last_version_before_timestamp
can be manually computed by the user from the contents of deltalake.DeltaTable(data_path).history()
. In that respect, last_version_before_timestamp
is an example of inference based on a system time managed by the deltalake runtime.
Things that are missing:
deltalake.DeltaTable(data_path).history()
),asof_join
on system time.The latter might be a direct consequence of implementing the former.
Thanks @ogrisel for sharing your thoughts.
As far as I understand, it's already possible to do
table_a.asof_join(table_b, on="time", by="entity_id")
in Ibis (using the duckdb backend for instance), but only for a regular time column (business/validity).
Yes, asof_join
is based on a time column.
It's also possible to load a table for a given system time for some combination of data source formats / backends, for instance using
ibis.read_delta(data_path, version=last_version_before_timestamp)
wherelast_version_before_timestamp
can be manually computed by the user from the contents ofdeltalake.DeltaTable(data_path).history()
. In that respect,last_version_before_timestamp
is an example of inference based on a system time managed by the deltalake runtime.
If I understand it correctly, this refers to "time travel" where a snapshot of a table is accessed. The specific table snapshot to load is determined by the backend for a given timestamp. We have an issue for time travel support: https://github.com/ibis-project/ibis/issues/8203
It is important to differentiate temporal join from time travel.
for system time as of <time-attribute-of-a-table> join
) enables joining the rows within the left-table
against the "corresponding rows" in the versioned right-table
(versions referring to the table snapshots). Corresponding rows in the right-table are found by matching the values they have in the time-attribute
columns defined for the left and right table. The versioned table can be based on an external source, e.g. for Flink
Versioned tables are defined implicitly for any tables whose underlying sources or formats directly define changelogs. Examples include the upsert Kafka source as well as database changelog formats such as debezium and canal.
Overall, temporal join allows for enriching the row values (indexed with a primary
key) with values from a changing source.
for system time as of <timestamp>
) enables specifying a point in time (timestamp) and querying the corresponding data of a table. This is more like accessing a snapshot of an Iceberg table.
Is your feature request related to a problem?
Event-time temporal join in Flink enables joining a table against a versioned table. Tables in Flink are temporal/dynamic, i.e., row values can change over time, or rows can be added or deleted. A versioned table contains one or more versioned table snapshots. With event-time temporal join, a table can be enriched with values retrieved from a versioned table at a certain point in time.
More information is available on the Flink doc.
Background
We previously implemented the event-time temporal join support in https://github.com/ibis-project/ibis/pull/7921. We put it on hold during the reviews due to two reasons:
asof_join()
, while temporal join does not really fit into the semantics ofasof-join
.Use case(s)
A generic example given in the Flink doc is as follows.
In this example, we have (1)
orders
table withorder_time
event-time attribute, andcurrency
andprice
fields, (2)currency_rates
table withupdate_time
event-time attribute, andcurrency
andconversion_rate
fields. Herecurrency_rates
is a versioned table. User would want to joinorders
withcurrency_rates
oncurrency
fields byorder_time
.As a more specific use case in ML space, temporal join is desired for training dataset generation:
Other backends
Supporting temporal join:
Not supporting temporal join:
processing-time
temporal joinNote: Temporal join seems to be also used to refer to the
as of join
support in other backends.However,
FOR SYSTEM_TIME AS OF
is supported by other backends -- though this fits into time travel nottemporal join
:This quote might explain why Flink seems to be the only backend supporting temporal join
Other streaming engines that support temporal join
API
Option 1:
temporal_join
Example:
Option 2:
VersionedTable
+at_time
+join
Temporal join in Flink is supported only against
versioned
tables:Adding the abstraction
VersionedTable
would enable Ibis to enforce the temporal join requirements. When the user defines a versioned Ibis table on a source that does not support versioning, we can let the backend error bubble up.VersionedTable
can be overridden to execute backend-specific requirement checks. For Flink, that would be checking if the versioned table has been defined with aprimary key
and anevent-time attribute
.Example: table_right = con.create_table(..., versioned=True)
Option 3: Extendasof_join
The previous attempt implemented this option. The rationale behind this was the only form of
asof
join supported by Flink SQL istemporal
join with theFOR SYSTEM_TIME AS OF
clause.Used Pandas
merge_asof
as the inspiration for the API example above.What version of ibis are you running?
8.0.0
What backend(s) are you using, if any?
Flink
Code of Conduct