Backtest data improvements

limx0 commented 3 years ago

We should take a look at improving the way data is loaded into the backtests. Theres' quite a few ways people could want to load data into nautilus, but I think we can make some improvements to the user experience by adding a module and a couple of helper classes for loading data.

I can think of the follow ways people might have gathered some data that they want to feed into nautilus (please add any I have missed!):

Data download / purchased from a vendor. Likely in a nice-ish format (csv or json), and should only require an adapter like interface to turn into nautilus objects.
Data recorded by the user, a script (or similar) that logs the raw format received from the broker/exchange. Probably need some massaging (removing log formatting like timestamps?) before being passed to an adapter.
Other not so nice data (PCAP etc?). Needs more processing, adapter might not exist; construct the objects on your own.

A couple of the things I think could be improved:

Theres no easy way to load json or other text/byte-like data into the backtests.
We should allow batching / streaming from some data source (files) that doesn't require loading the entire dataset into memory (but for caching / performance reasons we should also continue to allow it). This could either be a directory of files (backtesting over daily files for example) or one large file (> 100mb?) that we would like to stream in chunks.
There isn't any way to access data in a non-local storage class (s3 for example).
Operations involving parsing data into nautilus objects are likely going to be the most expensive thing that the backtest does, we could consider (optionally) caching the data into a friendlier format for faster repeated runs.

Proposal:

1) Improve discovery & access to backtest data.

Build on the excellent fsspec library to improve access to files stored in many places (gcs, s3 etc).
Add some sort of BacktestDataLoader (revive the backtest data container?) that can resolve a path in fsspec (path could be a file, or a directory of files. Could be locally or in s3), and discover file(s) for loading. 2) Add a BacktestDataTransformer which can take data from the Loader and parse nautilus objects (including any related munging of logs or various data formats)
Optionally, we could cache the output of this step for later 3) Add a BacktestTradingNode class with an event loop, and patch the backtest engine to be able to operate iteratively. This will allow us to handle streaming / batch data efficiently.

I'd be keen to hear other thoughts on the above, as well as knowing how everyone else is storing and accessing their data for backtests?

cjdsellers commented 3 years ago

All great points.

Adding my own thoughts, we could separate the planning and implementation stages on the data pipeline side into: Data storage/warehousing (many sources and formats both local and remote) -> data streaming/downloading into local environment DataStreamer? DataLoader ? fsspec -> data transformation into Nautilus objects DataTransformer? -> BacktestDataContainer rises from the grave. -> BacktestEngine data ingest and running (example/user scripts) and/or: -> BacktestBatchRunner? utilizing the above machinery.

Separating things in this ways allows users the flexibility to conduct research/exploration or other testing from an intermediate format (parquet etc), pandas DataFrames as required to interface with other Python libraries.

Then it may make sense to revive something like a BacktestDataContainer to hold data which has been converted into Nautilus objects for any reason including backtest runs. Following this approach would standardize the API of BacktestEngine to deal exclusively with built Nautilus objects, which would otherwise be a space complexity issue without the existence of some kind of batch runner.

The batch runners main task would be to orchestrate data ingest -> transformation -> backtest run (without a full reset) jobs.

It would probably also make sense to provide some standard parser base classes, which should be general enough to be used for either live or backtest use cases. Or some kind of multi-adapter fsspec -> Nautilus objects.

limx0 commented 3 years ago

Just an update - the first pass of this is done in #343. Still on the todo list are:

Iterative streaming (though this would be pretty simple for someone to add).
Backtest engine improvements to handle streaming data (event loop).
Allowing the catalog to be in a remote store (s3), rather than just locally.

I will likely address the above in the coming weeks.

limx0 commented 3 years ago

Closing this in favour of newer issues.

nautechsystems / nautilus_trader

Backtest data improvements #337

Proposal: