Catalog v2 - Githubissues

limx0 commented 1 year ago

Background

I'd like to propose a refactor of the data catalog, focusing on a couple of goals:

Increasing performance
Increasing usability
Reducing maintenance burden

I'm opening this issue to engage our community and encourage users to share their opinions or ideas on the current implementation of the Data Catalog. Fundamentally, the goal of the catalog is to allow users to store their data in a simple format that can be queried in a performant way to load data into the backtest engine for backtesting. It should be relatively easy for users to get their data into the catalog. A secondary goal is storing data for research purposes.

I should also note there is on-going work to "rustify" the catalog, but this is mostly related to parsing/deserializing the data retreived from the catalog and will continue / won't be affected by this issue.

Current State

Our (mostly my) initial implementation of the catalog has a couple of issues:

The code is a bit of a mess, it was written in quite an exploratory fashion, with a mixture of functional and OOP techniques.
I added quite a few "nice to have" features in, assuming users would get value from them.

All of this has made the maintenance burden quite high, and hacking on the catalog quite tough. We should consider if this is a core competency of the platform or if we can leverage other projects to achieve our desired outcome.

Proposal

The catalog has two main functions:

Allow users to write data to parquet files to some storage backend, ensuring the schema and values are valid.
Allow users to query this set of parquet files to retrieve and merge data for backtesting.

This seems like a common use case in the broader python community currently, and I am quite confident we can achieve our stated goals above by leveraging some existing packages in the python/rust ecosystem. I'll be splitting the writing/reading of the parquet files as existing packages seem to split in a similar way.

Writing data

Users need to be able to take raw data (tick or bar data typically) from an external source (recorded / purchased / downloaded) and write it into one or many parquet files somewhere (local or remote storage) in the correct format to ensure they can be read again in a performant way for backtesting or research.

At a minimum, I believe we need the following features:

The ability to partition data easily (users may be writing TB of data)
The ability to write to local or remote storage (s3/gcs/etc)

We currently leverage pyarrow datasets for writing, which offers partitioning and writing to remote storage, but leaves some of the management of the catalog to us.

I am considering using something like plateau (built on top of pyarrow) which offers a few more features for working with datasets and builds on real world expertise working with datasets similar to ours. There are a couple of appealing features we don't currently have (or have attempting to implement but not very well) (taken from https://plateau.readthedocs.io/en/latest/spec/format_specification.html):

Consistent dataset state at all times
Dataset state is only modified by atomic commits
Inverted indices for fast query planing
Query planing with O(1) calls to the remote store
Lifecycle management (garbage collection, retention, etc.)

Plateau currently only works with DataFrames, so we would need to do some work to integrate with Nautilus objects, but this may be worth doing to get the above "for free".

Reading Data

For reading data, we are looking for the most performant way to read from a series of parquet files from a storage backend. Because streaming is a strong use case for backtesting, we ideally need to be able to also stream data from our storage, deserialize it and pass it into the engine. Currently we implement this reading/querying in the data catalog, but there are several projects of interest doing similar things that we could leverage. We need the following features:

Ability to read from remote storage
Ability to stream larger than memory data
Efficiently query data from the actual files
(Ideally) Merge time series data before we pass to the engine

The projects that come to mind are:

DuckDB - an in-process C++ OLAP database, with support for parquet files
DataFusion - A rust based query execution framework, built on top of arrow
Polars - Rust based DataFrame library
Plateau - also offers the ability to query

Each of these libraries are trying to achieve similar-ish goals to us, and could potentially be leveraged on the querying side. Ideally we would maintain a simple API over the top of one of these libraries, benefiting from their continued development and improvements to their querying framework. The idea being, we would ship off a query off to one of the libraries, which would then handle querying/filtering and return some sort of iterator of raw arrow bytes that we could convert (via rust) into nautilus objects.

Other considerations

I think we should strip out all DataFrame related functionality from the catalog (for now). We currently have two code paths, one for nautilus objects and one for DataFrames which are hard to maintain. This is typically used for research purposes, but there is plenty of support for reading parquet files into DataFrames (or converting nautilus objects to dicts and then DataFrames). The performance requirements are much lower for this than backtesting, so I don't think we should be too concerned with it for now.
I haven't really touched on live persistence (streaming to feather files). This is tangentially related to the catalog and could definitely be improved, but is not part of this proposal.

I'm keen to hear any thoughts or ideas on the above, or anything I may have missed or misunderstood.

rsmb7z commented 1 year ago

Hi @limx0 This is indeed long awaited. While we need rich features for data analysis and easy read/write, the main chocking point is Historical data - mainly not because of data size but because of number of files involved. Problem with the DuckDB and DataFusion for this exists although we get SQL queries benefit, although there is some discussion on the fix for it.

My suggestion will be as follows.

Historical Data (Reading), we have fixed data types (Ticks and Bars). These are huge in data and the core requirement for Backtest to function. For this part instead of waiting and relying on third-party tools, what we have already and performs well, reading through Pandas we simply have to add the partition path in this case instrument_id and Pandas can directly read from there. The typical filtering required here is the bar_type, start_time, and end_time. I have uploaded the modified Catalog https://github.com/rsmb7z/nautilus_trader/tree/catalog_pandas which works like this. It is not the clean code but does the job and can give idea.
Historical Data (Writing), pandas can directly handle partitioning without any issue. So we simply convert the Nautilus objects to dataframe and let Pandas do the rest - with the simpler helper function to add the Partition columns i.e. instrument_id.
Any other data types can be read/written like above as well however filtering may be bit complex for user as compared to queries, so on top of it we leverage here the DuckDB or DataFusion. As we introduce this, the user can read/write any data including Bars/Ticks using queries but the default method to read/load historical data can be as in step 1.

Some feedback and example usage for the modified catalog using Pandas as follows. It has some extra helper functions which may be added as feature in new catalog.

bar_type = BarType.from_str('EUR/USD.IDEALPRO-1-MINUTE-MID-EXTERNAL')
start_time = pd.Timestamp("2005-01-01")
end_time = pd.Timestamp("2022-12-31")
as_nautilus = False

# Load Bars for given BarType
perf_start = time.perf_counter()
bars = catalog.bars(
    instrument_ids=[bar_type.instrument_id.value],
    filter_expr=("bar_type", "=", str(bar_type)),
    as_nautilus=as_nautilus,
    start=start_time,
    end=end_time,
)
print(f"{as_nautilus=}, loaded {len(bars)} Bars, in {time.perf_counter() - perf_start:0.4f} seconds")
# RESULT: as_nautilus=False, loaded 6564442 Bars, in 8.4057 seconds
# RESULT: as_nautilus=True, loaded 6564442 Bars, in 89.0130 seconds

# Loads Bars parquet as QuoteTicks (loads BID and ASK bars and then convert to ticks)
perf_start = time.perf_counter()
ticks_data = catalog.bars_to_ticks(
    instrument_ids=[bar_type.instrument_id.value],
    start=start_time,
    end=end_time,
)
print(f"as_nautilus=True, loaded {len(ticks_data)} Ticks, in {time.perf_counter() - perf_start:0.4f} seconds")
# RESULT: as_nautilus=True, loaded 26252848 Ticks, in 117.1738 seconds

# Load Bars to given BarType from smaller timeframe, for directly loading to Backtest engine or other analysis
perf_start = time.perf_counter()
bar_type = BarType.from_str('EUR/USD.IDEALPRO-5-MINUTE-MID-EXTERNAL')
bars = catalog.bars_resample(
    instrument_ids=[bar_type.instrument_id.value],
    price_type=bar_type.spec.price_type,
    bar_delta=bar_type.spec.timedelta,
    as_nautilus=as_nautilus,
    start=start_time,
    end=end_time,
    raise_on_empty=False,
    source_bar_delta=pd.Timedelta(minutes=1)
)
print(f"{as_nautilus=}, loaded {len(bars)} Bars, in {time.perf_counter() - perf_start:0.4f} seconds")
# RESULT: as_nautilus=False, loaded 1312949 Bars, in 9.0769 seconds
# RESULT: as_nautilus=True, loaded 1312949 Bars, in 28.1769 seconds

limx0 commented 1 year ago

Hey @rsmb7z, thanks for the feedback.

Ultimately need to pass Nautilus objects (QuoteTick, Bar) into the backtest engine for backtesting, so we need some way to create those objects, as fast as possibly. The libraries we are considering allow reading Arrow format (aka parquet) directly, which allow us a zero-copy and no-type-conversion way to create these objects.

Taking the pandas route looks more performant above when compared with our current implementation of as_nautilus=True, but will be significantly slower than the libraries mentioned above as more of the core types move to rust because we will have to go from arrow -> pandas (python) -> pyarrow -> nautilus (rust) where each hop will potentially incur data copying or type conversions/casting. But don't take my word for it - I will generate some example benchmarking code, and will be sure to include pandas.

What we will be taking a look at, which you correctly mention above, is the number of files we generate, which has a bit impact on performance. We opted for trying to choose a one-size-fits-all approach, but I think in v2 we will pass more of this responsibility (and flexibility) onto the user, guiding them to best practices (for example: efficient_querying).

limx0 commented 1 year ago

Some totally contrived and very initial timings can be found in a notebook here.

More research required though regarding our requirements and actual benchmarks.

rsmb7z commented 1 year ago

We don't need to iterate files, is there particular reason to reach file by file? Can you try with singal request just like others:

result = pd.read_parquet(path=FOLDER_PATH)

Also setting pre_buffer=True improves performance further. Other acceptable arguments for read_parquet are at following links: https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

limx0 commented 1 year ago

Nice thanks @rsmb7z - updated notebook here.

Pandas now on par with Polars for a simple read, but I imagine most of this will be lost when we do a proper benchmark of bytes -> list[QuoteTick] because the other libraries will be able to stay down in rust-land (where the QuoteTick struct is defined)

rsmb7z commented 1 year ago

After couple of more tests I see the DataFusion is leading the way. Leaving us to the following challenges:

Avoid reading root of class directory - when reading we need to set the FOLDER_PATH so it is of the partition i.e. bar.parquet/instrument_id or quote_tick.parquet/instrument_id etc then the performance will be astonishing. Once built-in handling of this is supported by DataFusion itself we may not have to maintain this.
Conversion to the Bars/Ticks. At the moment for Bars, I see when we convert pa.table > pandas > to_dict("records") > deserialize is much faster than pa.table > to_pydict > dict_of_lists_to_list_of_dicts > deserialize. It looks dict_of_lists_to_list_of_dicts is the choking point.

Harnas commented 1 year ago

What about move queries and conversion into Nautilus objects to separate thread/s? Then the backtest engine thread would receive ready data. I have no idea if it is worth implement this, it is just idea. It will also require implement prefetching mechanism.

Troubladore commented 1 year ago

Thanks Brad for opening this thread. Below are some of my thoughts on the topic.

I agree Nautilus should strive to leverage other projects for data management for sustainability. Generally speaking, as long as you adopt a single standard that is well-documented and well supported, we'll be okay. I've become a fan of the underlying Parquet storage format, and would be glad to see the project continue with it.

When I think about the catalog, I am primarily concerned with having 1 set of data that I can reuse for multiple purposes (input to Nautilus, direct data analytics, input to non-Nautilus projects). I get most of my data from a few providers (Polygon, others), and when we dealing with 10-15 years of tick/trade level data across 1000s of equities, it is challenging to store that data more than once. I can't afford to store it more than twice (1 raw copy + 1 curated copy). So, what I want is a storage format that allows me to store all the attributes I get from a provider (even ones Nautilus doesn't natively recognize as being part of a tick or bar) so the stored data can be natively consumed by Nautilus as well as other tools (DuckDB, DVC, etc).

It would be ideal if Nautilus natively provided a way to extend the base Nautilus dataset with my custom attributes. If I had to store extra attributes myself, the Nautilus data storage layout should be transparent enough for me to easily align my non-Nautilus data to my Nautilus-native data in a single dataset, and Nautilus should be tolerant of such extensions to the data schema (e.g. Nautilus should not break just because I added data elements to the catalog which it doesn't make use of).

It would help me if I could represent the catalog as a DVC-style data-registry. This would let me build repositories using DVC to index into the master catalog of data, allowing me to tier my storage according to what data gets used the most. It seems like Plateau and DVC could work in a complimentary fashion, but I'm not 100% on that since Plateau is new to me.

I like the way Plateau abstracts some of the data storage implementation and lets you work with factories and data frames.

Troubladore commented 1 year ago

Another question that has been in my mind is the handling of data with different frequencies - some features only change on a daily basis, others change at open/close of a market, etc. For that data, it is inefficient to embed it with higher frequency data (trade/tick/minute), so it would be good to have frequency-aligned catalogs that get 'merged' with other frequencies during a backtest simulation.

Today we've been thinking about accomplishing this with separate catalogs based on frequency, each loaded as a separate data config to the node, and subscribed to separately within our strategies (e.g. strategy is aware of data frequency, and subscribes appropriately).

Troubladore commented 1 year ago

More features I'd look for in a catalog:

Thread-safe reads/writes - I want to be able to read or write data to the catalog in parallel threads/processes, ideally for a single instrument, but minimally across separate instruments, so I can assemble catalogs efficiently on multi-core machines.
"Upsert" data adds - I'd like to be able to add arbitrary chunks of data to the catalog without worrying about how much of that data is there already, and without duplicating data (e.g. an option to just overwrite data if it is there).
Default storage organization - I'd like to be able to add arbitrary chunks of data to the catalog without worrying about how the data gets stored into files, because there is either a default storage organization scheme, or a customizable scheme which predefines where new data will go.

Some of these features may already exist in the catalog, I'm just trying to capture as many as I can remember wanting

limx0 commented 1 year ago

What about move queries and conversion into Nautilus objects to separate thread/s? Then the backtest engine thread would receive ready data. I have no idea if it is worth implement this, it is just idea. It will also require implement prefetching mechanism.

Hey @Harnas - yes this is something we will definitely look into, but I would like to focus on single-threaded performance first before we try and tackle this. It's also worth noting that much of what we do in the catalog is just to get the performance to an appropriate level (targeting low single digit microseconds to parse and merge ticks), because the vast majority of the time spend in a backtest will be user defined code in python. In my experience thus far though, I don't think users would notice a 2-3x speed up using threading if we can hit these benchmarks, because >90% of the backtest time will be spent running user code. We will definitely keep an eye on this though.

limx0 commented 1 year ago

Hey @Troubladore thanks for your thoughts - I'm going to comment on the main points/themes I can see with my thoughts:

Storage using a well known source / consuming data outside of nautilus / Having a single copy of data (aside from raw format)

We will definitely be striving for this. I will say that the data will be stored for optimal reading by nautilus (ie we will mangle prices to ints etc) so some transforms will be required to get it back to a "usable" format such as a Dataframe, but in my experience, these workflows (research etc) have a much lower bar for performance, so we can simple write some wrapper functions that undo any mangling we do, which I am confident from a performance perspective will not be noticed by any users.

the ability to store custom attributes from any data source / Adding my custom attributes to the data catalog.

This should currently be possible via subclassing of existing types (see BetfairTicker) or creating your own data types. Let me know if you have any more specific examples of this?

It would help me if I could represent the catalog as a DVC-style data-registry.

DVC is a great project, I believe it would be relatively easy though to integrate the two, but it won't be something we build into nautilus.

handling of data with different frequencies / different catalogs

This should be able to be handled in a single data catalog already. The only caveat may be bars of different intervals (I'm not 100% sure as I don't use them, but we may have fixed this recently) - but we should fix this, you should be able to store different bar types in a single catalog. I can't think of any other types where this is applicable (non of the tick types make sense in my mind), but happy to hear if there's any other specific issues on this front.

Thread-safe reads/writes - I want to be able to read or write data to the catalog in parallel threads/processes, ideally for a single instrument, but minimally across separate instruments, so I can assemble catalogs efficiently on multi-core machines.

Thread-safe reads should be fine, writes may be difficult and likely out of scope (depending on what we settle on the writing the catalog). You will likely have to come up with your own solution here, its not something we would want to maintain (again there are plenty of libraries that could help, and our time better spent improving the single thread performance I think).

"Upsert" data adds - I'd like to be able to add arbitrary chunks of data to the catalog without worrying about how much of that data is there already, and without duplicating data (e.g. an option to just overwrite data if it is there).

This currently exists in the catalog but will likely be removed, due to the maintenance burden; it is difficult to foresee the many types of data users may add to the catalog, and given we are building on well known and documented formats, I think this is best handled by the user.

Default storage organization - I'd like to be able to add arbitrary chunks of data to the catalog without worrying about how the data gets stored into files, because there is either a default storage organization scheme, or a customizable scheme which predefines where new data will go.

This is currently how data is written to the Catalog, and will remain this way. We will endeavour to provide a "best-effort" organisation format without any prior knowledge of the data users may write, but as per my comments above, users may benefit from tailoring or reorganising the Catalog to better suit their individual needs.

I hope I've covered everything there - let me know if not or if you have any follow up questions / comments.

limx0 commented 1 year ago

After couple of more tests I see the DataFusion is leading the way. Leaving us to the following challenges:

Avoid reading root of class directory - when reading we need to set the FOLDER_PATH so it is of the partition i.e. bar.parquet/instrument_id or quote_tick.parquet/instrument_id etc then the performance will be astonishing. Once built-in handling of this is supported by DataFusion itself we may not have to maintain this.

Conversion to the Bars/Ticks. At the moment for Bars, I see when we convert pa.table > pandas > to_dict("records") > deserialize is much faster than pa.table > to_pydict > dict_of_lists_to_list_of_dicts > deserialize. It looks dict_of_lists_to_list_of_dicts is the choking point.

Hey @rsmb7z - I'm not sure exactly what you mean by 1., but we do go straight to the correct partitional (bar/instrument_x) to read only the parquet files under the directory.

For 2., yes you're totally spot on; this is the bottleneck, which will soon be replaced by rust and cython (little or no python).

rsmb7z commented 1 year ago

Hey @rsmb7z - I'm not sure exactly what you mean by 1., but we do go straight to the correct partitional (bar/instrument_x) to read only the parquet files under the directory.

If this is so then perfect. I was referring to the old Catalog where it would lookup in the root and takes ages to go through all partitions even if when only specified partition (instrument_id) was specified.

For 2., yes you're totally spot on; this is the bottleneck, which will soon be replaced by rust and cython (little or no python).

Yes that shall give us significant performance improvements. Lets test once it is released.

limx0 commented 1 year ago

Hey @rsmb7z - I just realised that I misled you with my comment:

I'm not sure exactly what you mean by 1., but we do go straight to the correct partitional (bar/instrument_x) to read only the parquet files under the directory.

We used to do this, but then it got changed in a PR - I am going to re-implement it in an upcoming PR.

My apologies!

TianQingX commented 1 year ago

Dear limx0

Do you know a new release Dataframe database call ArticDB? It may help for your data management.

Regards, TianQing

limx0 commented 1 year ago

Hi @TianQingX - yes I'm aware of ArcticDB (and its predecessor Arctic), it's a great product and could be worth investigating for some users. The setup is not dissimilar to parquet datasets, but potentially with a few more bells and whistles. The downside from what I can see would be the cost of deserializing the ArcticDB format into nautilus objects, otherwise it seems very close to what we're looking for.

At this stage though, we have a good understanding of the arrow/parquet formats, and I think they fit out use case best.

bramo0 commented 11 months ago

Hello,

i am new to github, programming and nautilus. Interest: Trading, Deep learning, Distributed systems, Python and rust.

Above are mentions about storing and unify formats if i am correct. The one thing what i was always missing in storing data was orderbook data.

In the past i saw this rust project to store orderbook DATA just wanted to mensions this Should be a more efficient format because otherwise orderbook data becomes big fast tectonicdb https://github.com/0b01/tectonicdb Is it possible to store orderbook data in nautilus ?

Also interesting TimescaleDB https://github.com/timescale/timescaledb

Or i long term an influxdb connector

Kind Regards Bram

OnlyC commented 11 months ago

I'm using QuestDB to injest data from crypto exchanges, using InfluxDB Line Protocol to insert at really high throughput. Support Write-Ahead Log. You can query using PostgresSQL, support Aggregate, Sample functions or export to CSV, JSON through HTTP. Data are stored in partition files, can be backup easily. Data retention by time range. Easy to setup using Docker. Working with Dataframe: https://questdb.io/blog/loading-pandas-dataframes/

I just put it here as a good option if you guys consider a full features database. Naked files + simple API is still a better choice if we don't actually need all that luxury features.

nautechsystems / nautilus_trader

Catalog v2 #991

Background

Current State

Proposal

Writing data

Reading Data

Other considerations