timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.84k stars 884 forks source link

Add support for JOINing time series via "AS OF" time #271

Open min-mwei opened 7 years ago

min-mwei commented 7 years ago

For time series data, a powerful function is asof pioneered in kx/kdb, also being implemented by Pandas: http://pandas.pydata.org/pandas-docs/version/0.19.0/whatsnew.html#whatsnew-0190-enhancements-asof-merge

It would be really cool if Timescaledb could support it natively instead of having write 3 tricky queries to join two time series tables.

mfreed commented 7 years ago

Definitely an analytical function on our short list. Thanks for the request.

(And if others would similar like this, please upvote!)

ydesai0830 commented 6 years ago

Hi, is there an update on support for an As Of Join. I am looking to test out various Timeseries solutions and AsOf is something we would really like to support.

@mfreed

oldrichsmejkal commented 5 years ago

Hi,

any plan for this feature? It is major point for financial (tick) data use case, so it can help to adopt timescale in this niche.. Another reference from leading db used in this market..: https://code.kx.com/wiki/Reference/aj

Thanks!

runekaagaard commented 5 years ago

I agree. Having a performant global "AS OF" on a traditional RDMS like postgres would be a holy grail. Maintaining history today with postgres is, hmm, painful. This alone has made me seriously consider using Datomic (https://docs.datomic.com/on-prem/clojure/index.html#datomic.api/as-of). A dream query for me would be writing:

SELECT foo.id, bar.baz, sup.sop FROM foo
JOIN bar ON ...
JOIN sup ON ...
WHERE ...
LIMIT By 42
AS OF 2017-11-12

and getting the state of the world from that datetime. I realise this is moving from time series and into another territory which might not be a good fit for timescaledb. But a man can hope :)

franz101 commented 5 years ago

Any updates or best practices?

franz101 commented 5 years ago

This seems to be a possibility? https://dba.stackexchange.com/posts/185372/revisions

mridsole commented 5 years ago

This is the best I've managed to get so far:

SELECT * FROM table_a
CROSS JOIN LATERAL (
  SELECT * FROM table_b
  WHERE (
    ...
    AND
    table_a.time > table_b.time
  )
  ORDER BY table_b.time DESC LIMIT 1
) lookup
ORDER BY table_a.time;

Still not very performant though, takes >1s on two tables of ~200k rows (pandas can do it >50x faster).

EgorKraevTransferwise commented 4 years ago

Are there any plans to add this? Would be a really big deal for financial applications, among many others

kzk2000 commented 3 years ago

+1

Any update here? Is there an ETA for as-of join support?

runekaagaard commented 3 years ago

Very under the (my) radar, mariaDB suddenly has very cool looking support for AS OF time machining:

https://mariadb.com/kb/en/system-versioned-tables/

kzk2000 commented 3 years ago

@runekaagaard thanks, so no ETA?

Just to be crystal clear: We are referring to as-of joins and not AS OF for data versioning.

Good example for AS OF JOINS is here: https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html --> see bottom of that page which illustrates how to join market trades & quotes

The kdb+ equivalent command is "aj", see https://code.kx.com/q/ref/aj/

If I misunderstood the purpose of this issue here, let me know and I'm happy to open another one

alex-tate commented 3 years ago

The addition of as-of joins within Timescale would be a huge benefit to us. We have lots of high frequency time series datasets from environmental sensors that have slightly differing timestamps (at the ms scale) and occasional gaps. Our requirements are pretty much met by the functionality described in the pandas.merge_asof documentation but it would be great if we could do the same sort of thing at the database level. A possible addition would be to allow for aligning the sensor time series datasets against a strict timeline (e.g. every exact second, minute etc.).

davidkohn88 commented 3 years ago

The simple way to do this is to write the following sort of join:

CREATE TABLE foo ( time timestamptz, id int, val double precision);

CREATE INDEX on foo(id, time DESC);

SELECT t1.time, t1.value as t1_val, t2.value as t2_val
FROM foo AS t1, 
LATERAL (
  SELECT value
  FROM foo t2
  WHERE t2.id = 2 AND t2.time <= t1.time
  ORDER BY t2.time DESC
  LIMIT 1
) t2
WHERE t1.id = 1
ORDER BY t1.time;

You will definitely want the index on id, time DESC there, as that will make it much more efficient.

In cases where the tables are separate it will be slightly different, simpler in some ways as it involves less aliasing:


CREATE TABLE foo1 ( time timestamptz, val double precision);
CREATE TABLE foo2( time timestamptz, val double precision)

CREATE INDEX on foo1(time DESC);
CREATE INDEX ON foo2 (time DESC) INCLUDE (value);

SELECT foo1.time, foo1.value as foo1_val, foo2.value as foo2_val
FROM foo1,  
LATERAL (
  SELECT foo2.value
  FROM foo2 
  WHERE foo2.time <= foo1.time
  ORDER BY foo2.time DESC
  LIMIT 1
) foo2
ORDER BY foo1.time;

This may not always be the most efficient, but it should work reasonably well in smallish cases. We'll also work on some ways of doing this with the timeseries API as discussed: https://github.com/timescale/timescale-analytics/issues/162 that may be more efficient in some cases.

JLockerman commented 3 years ago

note: the LATERAL query above benefits significantly from being able to perform an index-only scan on foo2, you would want an index like

CREATE INDEX ON foo2 (time DESC) INCLUDE (value);
NunoFilipeSantos commented 3 years ago

Thank you @davidkohn88 this is a fantastic solution suggestion. đź‘Ť

enthusiastics commented 3 years ago

I am also very interested in this "as_of" join, what would be the best solution for now?

davidkohn88 commented 3 years ago

I am also very interested in this "as_of" join, what would be the best solution for now?

The joins above ( https://github.com/timescale/timescaledb/issues/271#issuecomment-865231568) are reasonable solutions for now and should be reasonably performant, depending on what you're doing, we're also thinking about adding some more functionality around this in the toolkit, but it's probably a bit of a ways off, but you can add comments in this issue for now, and maybe explain more what exactly you're trying to achieve and what you think we should prioritize: https://github.com/timescale/timescaledb-toolkit/issues/162

akuzm commented 3 years ago

With LATERAL join on timestamp inequality, I guess the right side is going to do index lookups, but it's still O(left_rows * log(right_rows)). A merge-join-like algorithm will be O(left_rows + right_rows).

An older attempt of implementing this in vanilla postgres: https://www.postgresql.org/message-id/flat/bc494762-26bd-b100-e1f9-a97901ddad57%40postgrespro.ru

For the reference, ClickHouse uses special grammar for this: https://clickhouse.com/docs/en/sql-reference/statements/select/join/#asof-join-usage

akuzm commented 3 years ago

The different grammar is probably required because the semantics is different from the normal join on inequality condition -- we only have to return the closest righthand row that matches. In an extension, we can't introduce new grammar (or can we?), so we can consider using a special dummy function for the join, e.g. JOIN ON left.series = right.series AND timescale.asof(left.timestamp <= right.ts).

I had an old patch that extended the merge join executor to support full join on inequality: https://www.postgresql.org/message-id/flat/b31e1a2d-5ed2-cbca-649e-136f1a7c4c31@postgrespro.ru I think I could simplify and reuse it for ASOF joins.

akuzm commented 3 years ago

Wrote a design memo for internal use: https://docs.google.com/document/d/1YEX038V-gq-iLfM-KpLV0zpNn3yW88I9qPVa0_bs4aQ/edit

akuzm commented 2 years ago

The discussion upstream: https://www.postgresql.org/message-id/flat/CALzhyqwuVz0FJZ-oCYQ9d%2ByrPrbF5a9HDyAjxuSUdgq8n7nshQ%40mail.gmail.com

alex-tate commented 2 years ago

Is there an agreed way forward on this issue even if addressing it isn't imminent? A related timescaledb-toolkit issue was closed last year pending work on 'multi-value timeseries' but it is not clear what this functionality refers to. The latest posts in this issue suggest the asof-join functionality could be addressed within Postgres itself but it is difficult to know from the discussion whether this is likely to happen in the near-future.

akuzm commented 2 years ago

Is there an agreed way forward on this issue even if addressing it isn't imminent? A related timescaledb-toolkit issue was closed last year pending work on 'multi-value timeseries' but it is not clear what this functionality refers to. The latest posts in this issue suggest the asof-join functionality could be addressed within Postgres itself but it is difficult to know from the discussion whether this is likely to happen in the near-future.

We're planning to try and prototype this inside the TimescaleDB extension in the Q4 of 2022, using the no. 3 merge-hash algorithm I posted upstream. That's just a research prototype, can't promise when and if we will release something that is actually usable.

RMB-eQuant commented 1 year ago

We have found the “as-of” join to be essential in our analyses involving high frequency financial data (market data, trade data etc.). The approach we present in the link below seems to be very fast and execution time appears to scale linearly with row count (n), unlike lateral join type approaches that typically scale as O(n^2). The SQL query we link to could perhaps be further optimized but we think that this is a promising approach in general.

https://gist.github.com/RMB-eQuant/758539f8914f2dd4461ec0ce144b048b

The table below compares the execution time of our approach (called “UNION ALL ALGO”) to the lateral join approach presented in the post https://github.com/timescale/timescaledb/issues/271#issuecomment-865231568 (“LATERAL JOIN”). Execution time is in seconds and the benchmarks were run on a Timescale-pro-100-16gb-2cpu-compute-optimized (2 CPU, 16 GB RAM, 100 GB storage) instance. In the table below, NaN values correspond to runs that were too slow to complete.

image