Open min-mwei opened 7 years ago
Definitely an analytical function on our short list. Thanks for the request.
(And if others would similar like this, please upvote!)
Hi, is there an update on support for an As Of Join. I am looking to test out various Timeseries solutions and AsOf is something we would really like to support.
@mfreed
Hi,
any plan for this feature? It is major point for financial (tick) data use case, so it can help to adopt timescale in this niche.. Another reference from leading db used in this market..: https://code.kx.com/wiki/Reference/aj
Thanks!
I agree. Having a performant global "AS OF" on a traditional RDMS like postgres would be a holy grail. Maintaining history today with postgres is, hmm, painful. This alone has made me seriously consider using Datomic (https://docs.datomic.com/on-prem/clojure/index.html#datomic.api/as-of). A dream query for me would be writing:
SELECT foo.id, bar.baz, sup.sop FROM foo
JOIN bar ON ...
JOIN sup ON ...
WHERE ...
LIMIT By 42
AS OF 2017-11-12
and getting the state of the world from that datetime. I realise this is moving from time series and into another territory which might not be a good fit for timescaledb. But a man can hope :)
Any updates or best practices?
This seems to be a possibility? https://dba.stackexchange.com/posts/185372/revisions
This is the best I've managed to get so far:
SELECT * FROM table_a
CROSS JOIN LATERAL (
SELECT * FROM table_b
WHERE (
...
AND
table_a.time > table_b.time
)
ORDER BY table_b.time DESC LIMIT 1
) lookup
ORDER BY table_a.time;
Still not very performant though, takes >1s on two tables of ~200k rows (pandas can do it >50x faster).
Are there any plans to add this? Would be a really big deal for financial applications, among many others
+1
Any update here? Is there an ETA for as-of join support?
Very under the (my) radar, mariaDB suddenly has very cool looking support for AS OF time machining:
@runekaagaard thanks, so no ETA?
Just to be crystal clear: We are referring to as-of joins and not AS OF for data versioning.
Good example for AS OF JOINS is here: https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html --> see bottom of that page which illustrates how to join market trades & quotes
The kdb+ equivalent command is "aj", see https://code.kx.com/q/ref/aj/
If I misunderstood the purpose of this issue here, let me know and I'm happy to open another one
The addition of as-of joins within Timescale would be a huge benefit to us. We have lots of high frequency time series datasets from environmental sensors that have slightly differing timestamps (at the ms scale) and occasional gaps. Our requirements are pretty much met by the functionality described in the pandas.merge_asof documentation but it would be great if we could do the same sort of thing at the database level. A possible addition would be to allow for aligning the sensor time series datasets against a strict timeline (e.g. every exact second, minute etc.).
The simple way to do this is to write the following sort of join:
CREATE TABLE foo ( time timestamptz, id int, val double precision);
CREATE INDEX on foo(id, time DESC);
SELECT t1.time, t1.value as t1_val, t2.value as t2_val
FROM foo AS t1,
LATERAL (
SELECT value
FROM foo t2
WHERE t2.id = 2 AND t2.time <= t1.time
ORDER BY t2.time DESC
LIMIT 1
) t2
WHERE t1.id = 1
ORDER BY t1.time;
You will definitely want the index on id, time DESC
there, as that will make it much more efficient.
In cases where the tables are separate it will be slightly different, simpler in some ways as it involves less aliasing:
CREATE TABLE foo1 ( time timestamptz, val double precision);
CREATE TABLE foo2( time timestamptz, val double precision)
CREATE INDEX on foo1(time DESC);
CREATE INDEX ON foo2 (time DESC) INCLUDE (value);
SELECT foo1.time, foo1.value as foo1_val, foo2.value as foo2_val
FROM foo1,
LATERAL (
SELECT foo2.value
FROM foo2
WHERE foo2.time <= foo1.time
ORDER BY foo2.time DESC
LIMIT 1
) foo2
ORDER BY foo1.time;
This may not always be the most efficient, but it should work reasonably well in smallish cases. We'll also work on some ways of doing this with the timeseries API as discussed: https://github.com/timescale/timescale-analytics/issues/162 that may be more efficient in some cases.
note: the LATERAL query above benefits significantly from being able to perform an index-only scan on foo2
, you would want an index like
CREATE INDEX ON foo2 (time DESC) INCLUDE (value);
Thank you @davidkohn88 this is a fantastic solution suggestion. đź‘Ť
I am also very interested in this "as_of" join, what would be the best solution for now?
I am also very interested in this "as_of" join, what would be the best solution for now?
The joins above ( https://github.com/timescale/timescaledb/issues/271#issuecomment-865231568) are reasonable solutions for now and should be reasonably performant, depending on what you're doing, we're also thinking about adding some more functionality around this in the toolkit, but it's probably a bit of a ways off, but you can add comments in this issue for now, and maybe explain more what exactly you're trying to achieve and what you think we should prioritize: https://github.com/timescale/timescaledb-toolkit/issues/162
With LATERAL join on timestamp inequality, I guess the right side is going to do index lookups, but it's still O(left_rows * log(right_rows)). A merge-join-like algorithm will be O(left_rows + right_rows).
An older attempt of implementing this in vanilla postgres: https://www.postgresql.org/message-id/flat/bc494762-26bd-b100-e1f9-a97901ddad57%40postgrespro.ru
For the reference, ClickHouse uses special grammar for this: https://clickhouse.com/docs/en/sql-reference/statements/select/join/#asof-join-usage
The different grammar is probably required because the semantics is different from the normal join on inequality condition -- we only have to return the closest righthand row that matches. In an extension, we can't introduce new grammar (or can we?), so we can consider using a special dummy function for the join, e.g. JOIN ON left.series = right.series AND timescale.asof(left.timestamp <= right.ts)
.
I had an old patch that extended the merge join executor to support full join on inequality: https://www.postgresql.org/message-id/flat/b31e1a2d-5ed2-cbca-649e-136f1a7c4c31@postgrespro.ru I think I could simplify and reuse it for ASOF joins.
Wrote a design memo for internal use: https://docs.google.com/document/d/1YEX038V-gq-iLfM-KpLV0zpNn3yW88I9qPVa0_bs4aQ/edit
Is there an agreed way forward on this issue even if addressing it isn't imminent? A related timescaledb-toolkit issue was closed last year pending work on 'multi-value timeseries' but it is not clear what this functionality refers to. The latest posts in this issue suggest the asof-join functionality could be addressed within Postgres itself but it is difficult to know from the discussion whether this is likely to happen in the near-future.
Is there an agreed way forward on this issue even if addressing it isn't imminent? A related timescaledb-toolkit issue was closed last year pending work on 'multi-value timeseries' but it is not clear what this functionality refers to. The latest posts in this issue suggest the asof-join functionality could be addressed within Postgres itself but it is difficult to know from the discussion whether this is likely to happen in the near-future.
We're planning to try and prototype this inside the TimescaleDB extension in the Q4 of 2022, using the no. 3 merge-hash algorithm I posted upstream. That's just a research prototype, can't promise when and if we will release something that is actually usable.
We have found the “as-of” join to be essential in our analyses involving high frequency financial data (market data, trade data etc.). The approach we present in the link below seems to be very fast and execution time appears to scale linearly with row count (n), unlike lateral join type approaches that typically scale as O(n^2). The SQL query we link to could perhaps be further optimized but we think that this is a promising approach in general.
https://gist.github.com/RMB-eQuant/758539f8914f2dd4461ec0ce144b048b
The table below compares the execution time of our approach (called “UNION ALL ALGO”) to the lateral join approach presented in the post https://github.com/timescale/timescaledb/issues/271#issuecomment-865231568 (“LATERAL JOIN”). Execution time is in seconds and the benchmarks were run on a Timescale-pro-100-16gb-2cpu-compute-optimized (2 CPU, 16 GB RAM, 100 GB storage) instance. In the table below, NaN values correspond to runs that were too slow to complete.
For time series data, a powerful function is asof pioneered in kx/kdb, also being implemented by Pandas: http://pandas.pydata.org/pandas-docs/version/0.19.0/whatsnew.html#whatsnew-0190-enhancements-asof-merge
It would be really cool if Timescaledb could support it natively instead of having write 3 tricky queries to join two time series tables.