Closed crazy25000 closed 3 years ago
I have little experience with ohlcv datasets but ... sure. Disk Cache can work for that. Make the key the timestamp and the value a tuple of OHLCV. Does that not work?
since it’s just for backtesting, it doesn’t seem like the data will need updating. Can it all fit in memory? Might be fastest to use numpy arrays that are serialized to disk.
The data gets updated weekly and here is a simple scenario:
import yfinance as yf
df1 = yf.download("SPY", start="2017-01-01", end="2017-04-30")
df2 = yf.download("GOOG", start="2017-01-01", end="2017-04-30")
df1:
Open High Low Close Adj Close Volume
Date
2017-01-03 225.039993 225.830002 223.880005 225.240005 208.213501 91366500
2017-01-04 225.619995 226.750000 225.610001 226.580002 209.452240 78744400
2017-01-05 226.270004 226.580002 225.479996 226.399994 209.285812 78379000
For each instrument, is there a way to query/get data based on the date for a symbol? The data won't all fit into memory unfortunately.
I saved both DataFrames to the cache using the example from the tutorial:
import diskcache as dc
cache = dc.Cache('tmp')
# Store the entire DataFrame to the cache using the column name as the key. Column values as numpy.array
cache.add(key='SPY', value={**{k:df1[k].values for k in df1.columns.to_list()}, 'date': df1.index.values})
cache.add(key='GOOG', value={**{k:df2[k].values for k in df2.columns.to_list()}, 'date': df2.index.values})
Querying each symbol by date is the part I'm not sure about.
How many different symbols and how many different dates?
Right now there are 10 symbols and ~10 million rows each. Each row has a unique date.
By the way, it doesn't have to be saved to the cache like in the example I gave. It was just a way for me to test and learn what the library can do. If you have a recommendation, would love to try it!
I may be off in the weeds but ten million rows seems like it'll fit in memory. Each "row" is 7 floats (8 bytes with double-precision) or ~56 bytes per row. With ten million rows it's about 560e6 bytes. Then with 10 symbols it's about 6GB. You can rent an e2-highmem-2 on Google Cloud with 16GB of memory for $65/month. Seems like a relatively easy solution for back-testing.
If you really want to use a key-value store then just make the key something like "SPY-2017-01-01" and the value a tuple of OHLCV. That'll work for a lot less memory (but will still be an order of magnitude slower than a Pandas DataFrame in memory).
The real trick is using a caching strategy that works well a sequential scan of data (as I assume occurs in back-testing). SQLite probably does well here by storing lots of data points in a page and loading a whole page at a time into memory.
Using a key like "SPY-2017-01-01" is probably too granular. Consider using "SPY-2017-01" and loading a whole month's worth of data at a time. You could also use Python's LRU cache to keep some of the data in memory.
Something kind of like:
import diskcache
from functools import lru_cache
cache = diskcache.Cache('/tmp/issue-199')
def get(symbol, year, month, day):
@lru_cache(maxsize=128) # Adjust maxsize based on month size.
def get_month(symbol, year, month):
key = f'{symbol}-{year}-{month}'
value = cache[key]
return value
month = get_month(symbol, year, month)
return month[day]
I may be off in the weeds but ten million rows seems like it'll fit in memory. Each "row" is 7 floats (8 bytes with double-precision) or ~56 bytes per row. With ten million rows it's about 560e6 bytes. Then with 10 symbols it's about 6GB. You can rent an e2-highmem-2 on Google Cloud with 16GB of memory for $65/month. Seems like a relatively easy solution for back-testing.
If you really want to use a key-value store then just make the key something like "SPY-2017-01-01" and the value a tuple of OHLCV. That'll work for a lot less memory (but will still be an order of magnitude slower than a Pandas DataFrame in memory).
The real trick is using a caching strategy that works well a sequential scan of data (as I assume occurs in back-testing). SQLite probably does well here by storing lots of data points in a page and loading a whole page at a time into memory.
Using a key like "SPY-2017-01-01" is probably too granular. Consider using "SPY-2017-01" and loading a whole month's worth of data at a time. You could also use Python's LRU cache to keep some of the data in memory.
Those calculations now make me wonder if there are memory bottlenecks in the pipeline. The memory load is triple that amount, around 18GB. And there are ~20 columns each.
Consider using "SPY-2017-01" and loading a whole month's worth
This is similar to what I was thinking also, but wanted to check if there was a built-in indexer/sorting method. Was secretly hoping I just missed it while reading the documentation :stuck_out_tongue:
Will test this out, thanks for talking it through!
Hello,
Would you recommend this for financial data that have OHLCV columns? Use case is to use it locally for back-testing purposes. Would need to be able to query by date/timestamps. I have it installed and after some testing, not sure how to architect a key-value structure that would allow querying by date/time. Or did I miss something in the documentation?
I like the idea of having it on disk and so far from initial tests with millions of rows, the performance of adding/getting makes it worth pursing for me :smiley:
In case someone is quick to reply with a 'recommended' time-series database, yes I've built/deployed the following and they do not perform well using their recommended/performant settings:
PostgreSQL, TimeScaleDB, marketstore, Influx
, and there was another, but can't remember the name. The only 2 worth keeping/using weresqlite
andarctic/mongodb
.