Create financial-focused OHLC aggregation

ryanbooz commented 2 years ago

Is your feature request related to a problem? Please describe. Many users of TimescaleDB utilize financial time-series data. Many of our examples and tutorials show how to do basic OHLC bucketing, saving at least four columns using MAX/MIN/FIRST/LAST as the functions. It's a repetitive processes and some functions don't work well with re-aggregation - and even when they do (MAX/MIN), most of our documentation/context warns users against using functions to re-aggregate unless they were designed to do so.

Describe the solution you'd like Saving an aggregate that stores the necessary data that can then be accessed at query time to return the appropriate value for financial charts. At a minimum, users would have to be able to easily identify the Open/High/Low/Closing value within a bucket. Allowing re-aggregation over smaller buckets would be powerful, which means that users could now satisfy multiple bucket widths from one CAGG, rather than having to create various granularities. (1-min, 5-min, 10-min, 60-min).

Describe alternatives you've considered Currently, the only solution we really have to offer people is to follow our schema examples for OHLC data using separate aggregate functions and creating multiple CAGGs. The big challenge here, is that developing an application to view multiple "zoom levels" isn't easy without more complicated UNION examples that pull from different CAGGs depending on the "zoom level" (overall query length). Being able to using the same SQL query and simply re-aggregate to a different time bucket would be 🔥

Additional context @davidkohn88 suggested I jump-start the conversation by creating the issue 😉. We have so many users that do this kind of aggregation. If we could simplify their storage (one CAGG) and query (only have to change the time bucket interval value) - this could simplify how users query data significantly.

jerryxwu commented 2 years ago

@ryanbooz @davidkohn88 One obvious UX design that I can think of looks like this,

Create a 1-minute CAGG from a raw trading data table called “raw_trades”.

CREATE MATERIALIZED VIEW ohlcv_1m
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS time,
       symbol, 
       ts_ohlcv_agg(ts, price, volume) AS ohlcv
FROM raw_trades
GROUP BY time, symbol

Rollup the 1-minute ohlcv aggregates to the desired time interval at query time and get open, high, low, close and volume data through accessor functions ts_open, ts_high, and etc.

SELECT time_bucket('1 hour', time) AS hour,
       ts_open(rollup(ohlcv)) AS open,
       ts_high(rollup(ohlcv)) AS high,
       ts_low(rollup(ohlcv)) AS low,
       ts_close(rollup(ohlcv) AS close,
       ts_volume(rollup(ohlcv) AS volume
FROM ohlcv_1m
WHERE time > $time_from AND
      symbol = $symbol
GROUP BY hour

This design follows our established two-step aggregation design pattern. Do you feel that's sufficient UX improvement worth pursuing?

davidkohn88 commented 2 years ago

I think we can remove the ts_ part of these, but otherwise, that's exactly the sort of thing I think we were thinking about. I think the UX improvement is pretty significant in terms of the way it works with continuous aggregates.

I think also that making this specific to the financial use case and using their terminology would be great here as well and act as an entrance point for folks where we might get more feedback on other things we could be doing with something like this.

ryanbooz commented 2 years ago

Yep. Overall I think this could be a huge win. The one challenge will be volume as we're finding that different APIs provide it in different ways. The provider we chose (allowed us to use their data for free!) provides an incremental value with each trade, but only during trading hours.

Other APIs provide the volume of each trade and the user has to determine it (with a SUM or similar).

So, knowing the basic UX of how most APIs tend to provide data might be helpful here. If it's only two or three ways, maybe we have different accessors depending on what's being stored in volume?

davidkohn88 commented 2 years ago

Leaving out the volume part might also be easier here, though creating a weighted average price could be a useful metric we could add on to the typical that would be useful there if we had volume.

I think we'd need an argument to agg about how the volume works, though we might be able to just store the results for both things, ie, store the sum and the min/max of volume and depending on the accessor either output the sum or the max -min depending on the type...hmmm...could end up overflowing in some cases, so would need to account for that...but not crazy...

ryanbooz commented 2 years ago

I like the weighted average price idea!

The argument agg/accessor is an interesting idea. I suppose one of the benefits of storing both could be that we future proof the data as an API might change how they'd present the data over time, but users wouldn't (necessarily) have to change the CAGG to accommodate that... just the query. 🤔

jerryxwu commented 2 years ago

I also like the idea of volume-weighted average. This is how Investopedia defines it:

The volume-weighted average price (VWAP) is a technical analysis indicator used on intraday charts that resets at the start of every new trading session.

It appears to me that we may want to think this problem in a broader stock/security trading context. For example, once a CAGG is defined, we need to have a way to deal with stock splits?

zseta commented 2 years ago

I think this feature would greatly help users who deal with financial data. Additional values that we could calculate within the aggregation are technical indicators. AFAIK, most if not all technical indicators can be calculated from the raw tick data. On Twelve Data Docs you can find a list of technical indicators that we could consider. You can also see which are in high demand that could be prioritized to implement first. An example would look like this (extending the previous SQL example):

SELECT time_bucket('1 hour', time) AS hour,
       open(rollup(ohlcv)) AS open,
       high(rollup(ohlcv)) AS high,
       low(rollup(ohlcv)) AS low,
       close(rollup(ohlcv) AS close,
       rsi(rollup(ohlcv) AS rsi_indicator, --calculates the magnitude of a price change
       macd(rollup(ohlcv) AS macd_indicator, --Moving Average Convergence Divergence
       ema(rollup(ohlcv) AS ema --Exponential Moving Average, places greater importance on recent data points
FROM ohlcv_1m
WHERE time > $time_from AND symbol = $symbol
GROUP BY hour

zhenyakovalyov commented 2 years ago

as a related question to technical indicator functionality - is there a lower level API that one can use to define a custom function (such as a volume weighted average price calculation)?

rtwalker commented 2 years ago

@zhenyakovalyov we're planning to release an OHLC aggregate (without volume taken into account) in the very near future. This was a more straightforward implementation and gives us some more time to think about how we'd like to treat volume data and calculations.

We're aware of volume being reported as either

per trade volume
per day running/adjusted/incremental volume

Do you have an expectation as one of these being the default case instead of the other? For example, we're considering a finance_agg (placeholder name) aggregate that might do OHLC, moving average, and other basic technical indicator calculations with the timestamp, price, volume inputs and an optional, default parameter to indicate the volume format that could look like:

CREATE MATERIALIZED VIEW ohlcv
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS time
     , symbol
     , finance_agg(ts, price, volume)
  FROM raw_trades
 GROUP BY time, symbol

or

CREATE MATERIALIZED VIEW ohlcv
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS time
     , symbol
     , finance_agg(ts, price, volume, incremental_volume => true)
  FROM raw_trades
 GROUP BY time, symbol

We would love feedback on this kind of API while we're still in the early stages of designing it! For example, is what you had in mind more like another argument, e.g. finance_agg(..., weight_price_by_volume => true) or something else?

zseta commented 2 years ago

I like this API so far! 💯 Some feedback if it helps: About volume, in my experience there can be three ways to calculate volume within the aggregate function, depending on the raw data:

scenario 1: raw data contains no volume field at all (eg.: one of the polygon APIs works this way AFAIK), in this case it should be aggregated by doing a count(*) or similar (and so volume needs to be an optional parameter for the agg function because in this case it doesn't exist)
scenario 2: raw data contains volume in an incremental way, in this case we need to find the last(volume, time) value to get the aggregated volume for the chosen time_bucket() interval
scenario 3: raw data contains volume and other fields in an already aggregated form, this means that the raw data is already aggregated (there are already open-high-low-close-volume fields) and user just wants to use a larger time_bucket interval (eg to go from 1m to 1d). In this case the current API could not be used. Do we want to accommodate this use case as well? (or let's just focus on non-aggreagted raw data)

SuperDaveOsbourne commented 2 years ago

If there were two columns, one for period volume and one for aggregated volume as many data companies deliver it, then less of an issue. If only one then I would look to have just the period volume as it's a lot easier (maybe less costly?) to sum the volume column than have a window function where you subtract the previous from the current volume column. Could that be a switch, if only one column would deliver the difference between the two previous current) as an option? A kind of built-in window function for volume?

rtwalker commented 2 years ago

closed by #596

rnbokade commented 1 year ago

Is there a scope to add another function to calculate proce change as (current close - previous close) by using lag() and accounting for resets as such... Same can also be done for change in volume where volume is given as cumulative volume.

Also there could be extension to include another value called "open interest" which is generally a very imp indicator in derivative market.

timescale / timescaledb-toolkit

Create financial-focused OHLC aggregation #445