grantjenks / python-diskcache

Python disk-backed cache (Django-compatible). Faster than Redis and Memcached. Pure-Python.
http://www.grantjenks.com/docs/diskcache/
Other
2.33k stars 130 forks source link

Is there a recommended recipe for OHLCV financial data? #199

Closed crazy25000 closed 3 years ago

crazy25000 commented 3 years ago

Hello,

Would you recommend this for financial data that have OHLCV columns? Use case is to use it locally for back-testing purposes. Would need to be able to query by date/timestamps. I have it installed and after some testing, not sure how to architect a key-value structure that would allow querying by date/time. Or did I miss something in the documentation?

I like the idea of having it on disk and so far from initial tests with millions of rows, the performance of adding/getting makes it worth pursing for me :smiley:

In case someone is quick to reply with a 'recommended' time-series database, yes I've built/deployed the following and they do not perform well using their recommended/performant settings: PostgreSQL, TimeScaleDB, marketstore, Influx, and there was another, but can't remember the name. The only 2 worth keeping/using were sqlite and arctic/mongodb.

grantjenks commented 3 years ago

I have little experience with ohlcv datasets but ... sure. Disk Cache can work for that. Make the key the timestamp and the value a tuple of OHLCV. Does that not work?

since it’s just for backtesting, it doesn’t seem like the data will need updating. Can it all fit in memory? Might be fastest to use numpy arrays that are serialized to disk.

crazy25000 commented 3 years ago

The data gets updated weekly and here is a simple scenario:

import yfinance as yf

df1 = yf.download("SPY", start="2017-01-01", end="2017-04-30")
df2 = yf.download("GOOG", start="2017-01-01", end="2017-04-30")

df1:

                  Open        High         Low       Close   Adj Close    Volume
Date                                                                            
2017-01-03  225.039993  225.830002  223.880005  225.240005  208.213501  91366500
2017-01-04  225.619995  226.750000  225.610001  226.580002  209.452240  78744400
2017-01-05  226.270004  226.580002  225.479996  226.399994  209.285812  78379000

For each instrument, is there a way to query/get data based on the date for a symbol? The data won't all fit into memory unfortunately.

I saved both DataFrames to the cache using the example from the tutorial:

import diskcache as dc

cache = dc.Cache('tmp')

# Store the entire DataFrame to the cache using the column name as the key. Column values as numpy.array
cache.add(key='SPY', value={**{k:df1[k].values for k in df1.columns.to_list()}, 'date': df1.index.values})
cache.add(key='GOOG', value={**{k:df2[k].values for k in df2.columns.to_list()}, 'date': df2.index.values})

Querying each symbol by date is the part I'm not sure about.

grantjenks commented 3 years ago

How many different symbols and how many different dates?

crazy25000 commented 3 years ago

Right now there are 10 symbols and ~10 million rows each. Each row has a unique date.

crazy25000 commented 3 years ago

By the way, it doesn't have to be saved to the cache like in the example I gave. It was just a way for me to test and learn what the library can do. If you have a recommendation, would love to try it!

grantjenks commented 3 years ago

I may be off in the weeds but ten million rows seems like it'll fit in memory. Each "row" is 7 floats (8 bytes with double-precision) or ~56 bytes per row. With ten million rows it's about 560e6 bytes. Then with 10 symbols it's about 6GB. You can rent an e2-highmem-2 on Google Cloud with 16GB of memory for $65/month. Seems like a relatively easy solution for back-testing.

If you really want to use a key-value store then just make the key something like "SPY-2017-01-01" and the value a tuple of OHLCV. That'll work for a lot less memory (but will still be an order of magnitude slower than a Pandas DataFrame in memory).

The real trick is using a caching strategy that works well a sequential scan of data (as I assume occurs in back-testing). SQLite probably does well here by storing lots of data points in a page and loading a whole page at a time into memory.

Using a key like "SPY-2017-01-01" is probably too granular. Consider using "SPY-2017-01" and loading a whole month's worth of data at a time. You could also use Python's LRU cache to keep some of the data in memory.

grantjenks commented 3 years ago

Something kind of like:

import diskcache

from functools import lru_cache

cache = diskcache.Cache('/tmp/issue-199')

def get(symbol, year, month, day):

    @lru_cache(maxsize=128)  # Adjust maxsize based on month size.
    def get_month(symbol, year, month):
        key = f'{symbol}-{year}-{month}'
        value = cache[key]
        return value

    month = get_month(symbol, year, month)
    return month[day]
crazy25000 commented 3 years ago

I may be off in the weeds but ten million rows seems like it'll fit in memory. Each "row" is 7 floats (8 bytes with double-precision) or ~56 bytes per row. With ten million rows it's about 560e6 bytes. Then with 10 symbols it's about 6GB. You can rent an e2-highmem-2 on Google Cloud with 16GB of memory for $65/month. Seems like a relatively easy solution for back-testing.

If you really want to use a key-value store then just make the key something like "SPY-2017-01-01" and the value a tuple of OHLCV. That'll work for a lot less memory (but will still be an order of magnitude slower than a Pandas DataFrame in memory).

The real trick is using a caching strategy that works well a sequential scan of data (as I assume occurs in back-testing). SQLite probably does well here by storing lots of data points in a page and loading a whole page at a time into memory.

Using a key like "SPY-2017-01-01" is probably too granular. Consider using "SPY-2017-01" and loading a whole month's worth of data at a time. You could also use Python's LRU cache to keep some of the data in memory.

Those calculations now make me wonder if there are memory bottlenecks in the pipeline. The memory load is triple that amount, around 18GB. And there are ~20 columns each.

Consider using "SPY-2017-01" and loading a whole month's worth

This is similar to what I was thinking also, but wanted to check if there was a built-in indexer/sorting method. Was secretly hoping I just missed it while reading the documentation :stuck_out_tongue:

Will test this out, thanks for talking it through!