nautechsystems / nautilus_trader

A high-performance algorithmic trading platform and event-driven backtester
https://nautilustrader.io
GNU Lesser General Public License v3.0
2.14k stars 481 forks source link

Streaming Backtest with multiple datasets time synchronization issue #1515

Closed dimitar-petrov closed 5 months ago

dimitar-petrov commented 8 months ago

Bug Report

Expected Behavior

Data from multiple datasets sholud be streamed following the temporal dependencies.

Actual Behavior

I created streaming backtesting engine with batch_size_bytes=2**17 and a DummyStrategy that is logging when receives any data with the following statement.

self.log.debug("[MARK] event: {}".format(event.ts_init))

And noticed that the ts_init is not monotonic increasing which means that timestamps of events are not received chronologically.

ldf = pd.read_json("/tmp/nautilus.json", lines=True)
df = ldf.query("message.str.contains('\[MARK\]')").message.str.split(
    expand=True).rename(columns={1: "type", 2: "init_ts"}).astype({"init_ts": "int"})
df = df.set_index(pd.to_datetime(df["init_ts"]))[["type"]]
df.index.is_monotonic_increasing
#+end_src

#+RESULTS:
: False

Also pathched the backtest/node.py as follows to debug.

modified   nautilus_trader/backtest/node.py
@@ -300,8 +300,19 @@ class BacktestNode:

         # Stream data
         for chunk in session.to_query_result():
+            data = capsule_to_list(chunk)
+            df = pd.DataFrame(x.to_dict(x) for x in data)
+            df = (
+                df.assign(timestamp=pd.to_datetime(df["ts_init"]))
+                .groupby(["type"])
+                .agg({"timestamp": ["min", "max"], "type": "count"})
+                .droplevel(0, axis=1)
+            )
+
+            engine.logger.info("[Data Chunk] Stats:\n{}".format(str(df)))
             engine.add_data(
-                data=capsule_to_list(chunk),
+                data=data,
+                # data=capsule_to_list(chunk),
                 validate=False,  # Cannot validate mixed type stream
                 sort=True,  # Temporarily sorting  # Already sorted from kmerge
             )

Logs

2024-02-26T13:59:41.298463103Z [INFO] BACKTESTER-001.BacktestEngine: [Data Chunk] Stats:
                                   min                     max   count
type
OrderBookDelta 2023-11-30 23:59:49.004 2023-11-30 23:59:56.543  106066
QuoteTick      2023-12-01 00:00:00.012 2023-12-01 00:02:29.099   21192
TradeTick      2023-12-01 00:00:00.012 2023-12-01 00:02:29.099    3814

2024-02-26T13:59:41.306427965Z [INFO] BACKTESTER-001.BacktestEngine: Added 131,072 data elements.
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine:  BACKTEST RUN
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: =================================================================
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Run config ID:  f0d8ecbde85c3adf31907c9ca4a0882f8ca3f4bad336a2922d4e08b318f77a93
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Run ID:         a5362370-5daa-4d1e-a062-c3cd7beca05a
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Run started:    2024-02-26 13:59:41.306421+00:00
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Backtest start: 2023-11-30 23:59:49.004000+00:00
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Batch start:    2023-11-30 23:59:49.004000+00:00
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Batch end:      2023-12-01 00:02:29.099000+00:00
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: -----------------------------------------------------------------

2023-12-01T00:02:29.099000000Z [INFO] BACKTESTER-001.BacktestEngine: [Data Chunk] Stats:
                                   min                     max   count
type
OrderBookDelta 2023-11-30 23:59:56.543 2023-12-01 00:03:31.470  123333
QuoteTick      2023-12-01 00:02:29.104 2023-12-01 00:03:31.469    6795
TradeTick      2023-12-01 00:02:29.121 2023-12-01 00:03:31.469     944

2023-12-01T00:02:29.099000000Z [INFO] BACKTESTER-001.BacktestEngine: Added 131,072 data elements.
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: =================================================================
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine:  BACKTEST RUN
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: =================================================================
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Run config ID:  f0d8ecbde85c3adf31907c9ca4a0882f8ca3f4bad336a2922d4e08b318f77a93
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Run ID:         a5362370-5daa-4d1e-a062-c3cd7beca05a
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Run started:    2024-02-26 13:59:41.306421+00:00
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Backtest start: 2023-11-30 23:59:49.004000+00:00
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Batch start:    2023-11-30 23:59:49.004000+00:00
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Batch end:      2023-12-01 00:03:31.470000+00:00
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: -----------------------------------------------------------------

2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine: [Data Chunk] Stats:
                                   min                     max   count
type
OrderBookDelta 2023-12-01 00:00:23.417 2023-12-01 00:04:41.575  124205
QuoteTick      2023-12-01 00:03:31.470 2023-12-01 00:04:41.573    5704
TradeTick      2023-12-01 00:03:31.485 2023-12-01 00:04:41.572    1163
2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine: Added 131,072 data elements.
2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine: =================================================================
2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine:  BACKTEST RUN
2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine: =================================================================
2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine: Run config ID:  f0d8ecbde85c3adf31907c9ca4a0882f8ca3f4bad336a2922d4e08b318f77a93
2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine: Run ID:         a5362370-5daa-4d1e-a062-c3cd7beca05a
2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine: Run started:    2024-02-26 13:59:41.306421+00:00
2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine: Backtest start: 2023-11-30 23:59:49.004000+00:00
2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine: Batch start:    2023-11-30 23:59:49.004000+00:00
2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine: Batch end:      2023-12-01 00:04:41.575000+00:00
2023-12-01T00:03:31.470000000Z [INFO] BACKTESTER-001.BacktestEngine: -----------------------------------------------------------------

2023-12-01T00:04:41.575000000Z [INFO] BACKTESTER-001.BacktestEngine: [Data Chunk] Stats:
                                   min                     max   count
type
OrderBookDelta 2023-12-01 00:00:56.189 2023-12-01 00:04:59.995  126505
QuoteTick      2023-12-01 00:04:41.575 2023-12-01 00:04:59.574    4101
TradeTick      2023-12-01 00:04:41.609 2023-12-01 00:04:59.192     466
2023-12-01T00:04:41.575000000Z [INFO] BACKTESTER-001.BacktestEngine: Added 131,072 data elements.
2023-12-01T00:04:41.575000000Z [INFO] BACKTESTER-001.BacktestEngine: =================================================================
2023-12-01T00:04:41.575000000Z [INFO] BACKTESTER-001.BacktestEngine:  BACKTEST RUN
2023-12-01T00:04:41.575000000Z [INFO] BACKTESTER-001.BacktestEngine: =================================================================
2023-12-01T00:04:41.575000000Z [INFO] BACKTESTER-001.BacktestEngine: Run config ID:  f0d8ecbde85c3adf31907c9ca4a0882f8ca3f4bad336a2922d4e08b318f77a93
2023-12-01T00:04:41.575000000Z [INFO] BACKTESTER-001.BacktestEngine: Run ID:         a5362370-5daa-4d1e-a062-c3cd7beca05a
2023-12-01T00:04:41.575000000Z [INFO] BACKTESTER-001.BacktestEngine: Run started:    2024-02-26 13:59:41.306421+00:00
2023-12-01T00:04:41.575000000Z [INFO] BACKTESTER-001.BacktestEngine: Backtest start: 2023-11-30 23:59:49.004000+00:00
2023-12-01T00:04:41.575000000Z [INFO] BACKTESTER-001.BacktestEngine: Batch start:    2023-11-30 23:59:49.004000+00:00
2023-12-01T00:04:41.575000000Z [INFO] BACKTESTER-001.BacktestEngine: Batch end:      2023-12-01 00:04:59.995000+00:00
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: -----------------------------------------------------------------

2023-12-01T00:04:59.995000000Z [INFO] BACKTESTER-001.BacktestEngine: [Data Chunk] Stats:
                                   min                     max  count
type
OrderBookDelta 2023-12-01 00:02:05.216 2023-12-01 00:02:29.624  35816
2023-12-01T00:04:59.995000000Z [INFO] BACKTESTER-001.BacktestEngine: Added 35,816 data elements.
2023-12-01T00:04:59.995000000Z [INFO] BACKTESTER-001.BacktestEngine: =================================================================
2023-12-01T00:04:59.995000000Z [INFO] BACKTESTER-001.BacktestEngine:  BACKTEST RUN
2023-12-01T00:04:59.995000000Z [INFO] BACKTESTER-001.BacktestEngine: =================================================================
2023-12-01T00:04:59.995000000Z [INFO] BACKTESTER-001.BacktestEngine: Run config ID:  f0d8ecbde85c3adf31907c9ca4a0882f8ca3f4bad336a2922d4e08b318f77a93
2023-12-01T00:04:59.995000000Z [INFO] BACKTESTER-001.BacktestEngine: Run ID:         a5362370-5daa-4d1e-a062-c3cd7beca05a
2023-12-01T00:04:59.995000000Z [INFO] BACKTESTER-001.BacktestEngine: Run started:    2024-02-26 13:59:41.306421+00:00
2023-12-01T00:04:59.995000000Z [INFO] BACKTESTER-001.BacktestEngine: Backtest start: 2023-11-30 23:59:49.004000+00:00
2023-12-01T00:04:59.995000000Z [INFO] BACKTESTER-001.BacktestEngine: Batch start:    2023-11-30 23:59:49.004000+00:00
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: Batch end:      2023-12-01 00:04:59.995000+00:00
2023-11-30T23:59:49.004000000Z [INFO] BACKTESTER-001.BacktestEngine: -----------------------------------------------------------------

2024-02-26T13:59:55.914602964Z [INFO] BACKTESTER-001.BacktestEngine: =================================================================
2024-02-26T13:59:55.914606769Z [INFO] BACKTESTER-001.BacktestEngine:  BACKTEST POST-RUN
2024-02-26T13:59:55.914608755Z [INFO] BACKTESTER-001.BacktestEngine: =================================================================
2024-02-26T13:59:55.914610711Z [INFO] BACKTESTER-001.BacktestEngine: Run config ID:  f0d8ecbde85c3adf31907c9ca4a0882f8ca3f4bad336a2922d4e08b318f77a93
2024-02-26T13:59:55.914612511Z [INFO] BACKTESTER-001.BacktestEngine: Run ID:         a5362370-5daa-4d1e-a062-c3cd7beca05a
2024-02-26T13:59:55.914621010Z [INFO] BACKTESTER-001.BacktestEngine: Run started:    2024-02-26 13:59:41.306421+00:00
2024-02-26T13:59:55.914624941Z [INFO] BACKTESTER-001.BacktestEngine: Run finished:   2024-02-26 13:59:55.914550+00:00
2024-02-26T13:59:55.914642243Z [INFO] BACKTESTER-001.BacktestEngine: Elapsed time:   0 days 00:00:14.608129
2024-02-26T13:59:55.914648343Z [INFO] BACKTESTER-001.BacktestEngine: Backtest start: 2023-11-30 23:59:49.004000+00:00
2024-02-26T13:59:55.914652710Z [INFO] BACKTESTER-001.BacktestEngine: Backtest end:   2023-12-01 00:04:59.995000+00:00
2024-02-26T13:59:55.914660087Z [INFO] BACKTESTER-001.BacktestEngine: Backtest range: 0 days 00:05:10.991000
2024-02-26T13:59:55.914662542Z [INFO] BACKTESTER-001.BacktestEngine: Iterations: 1,870,824
2024-02-26T13:59:55.914664736Z [INFO] BACKTESTER-001.BacktestEngine: Total events: 0
2024-02-26T13:59:55.914668564Z [INFO] BACKTESTER-001.BacktestEngine: Total orders: 0
2024-02-26T13:59:55.914675042Z [INFO] BACKTESTER-001.BacktestEngine: Total positions: 0

How are datasources with different volume synchronized when passed to the engine. I can see that DataBackendSession is queried with specific chunksize but is the rust side of that object taking care of synchronization of multiple data sets.

Specifications

cjdsellers commented 8 months ago

Hi @dimitar-petrov

Many thanks for the report.

We have a k-merge algorithm under the hood in the Rust backend session which should ensure non-decreasing timestamps as it reads from multiple parquet sources.

It's possible that this is a timestamping issue with the logging and not the data, are you able to log the actual ts_init of the data, or add an assertion in your strategy to check that data isn't being received out of order?

Also, we generally only investigate bug reports on the latest stable version which is currently 1.188.0, and I notice you're back on what looks like 1.175.0 (and there have been a tonne of fixes, especially in the last release).

dimitar-petrov commented 8 months ago

Hi @cjdsellers,

It's possible that this is a timestamping issue with the logging and not the data, are you able to log the actual ts_init of the data, or add an assertion in your strategy to check that data isn't being received out of order?

I am logging the ts_init actually but you are correct that logging can still print out of order. Will do the assertion in the strategy, that sounds like a great idea.

Also, we generally only investigate bug reports on the latest stable version which is currently 1.188.0, and I notice you're back on what looks like 1.175.0 (and there have been a tonne of fixes, especially in the last release).

Will test on the latest stable and update.

Thanks for the quick response.

dimitar-petrov commented 8 months ago

@cjdsellers,

I have pulled the latest stable but the issue persist in my env. Took the recommended route to assert ts_init in the strategy.

Steaming Results

Received Events Out Of Order: 
defaultdict(<class 'int'>, {'OrderBookDeltas': 1420567, 'TradeTick': 18963, 'QuoteTick': 111175})

Results Without Streaming

Received Events Out Of Order: defaultdict(<class 'int'>, {})

To me looks like k-merge algo does not work as expected but I am open to any recommendations to check in my env.

Please take a look at the patch of node.py I have shared above and look at just the first two batches of data: I am logging the data stats that is comming from k-merge in rust with [Data Chunk] Stats marker. Each chunk is grouped by datatype and max and min timestamp are printed before data is passed to the backtest engine.

=================================================================
 BACKTEST RUN
=================================================================
Run config ID:  f0d8ecbde85c3adf31907c9ca4a0882f8ca3f4bad336a2922d4e08b318f77a93
Run ID:         c3fbb5b8-ee03-453c-93a2-a09fc75690a8
Run started:    2024-02-27 12:37:32.706503+00:00
Backtest start: 2023-11-30 23:59:49.004000+00:00
Batch start:    2023-11-30 23:59:49.004000+00:00
Batch end:      2023-12-01 00:01:24.562000+00:00
-----------------------------------------------------------------
[Data Chunk] Stats:
                                   min                     max   count
type
OrderBookDelta 2023-11-30 23:59:58.028 2023-12-01 00:03:25.881  113756
QuoteTick      2023-12-01 00:01:24.563 2023-12-01 00:03:25.852   15160
TradeTick      2023-12-01 00:01:24.718 2023-12-01 00:03:25.698    2156
Added 131,072 data elements.

=================================================================
 BACKTEST RUN
=================================================================
Run config ID:  f0d8ecbde85c3adf31907c9ca4a0882f8ca3f4bad336a2922d4e08b318f77a93
Run ID:         c3fbb5b8-ee03-453c-93a2-a09fc75690a8
Run started:    2024-02-27 12:37:32.706503+00:00
Backtest start: 2023-11-30 23:59:49.004000+00:00
Batch start:    2023-11-30 23:59:49.004000+00:00
Batch end:      2023-12-01 00:03:25.881000+00:00
-----------------------------------------------------------------
[Data Chunk] Stats:
                                   min                     max   count
type
OrderBookDelta 2023-12-01 00:00:24.711 2023-12-01 00:04:38.362  124762
QuoteTick      2023-12-01 00:03:25.998 2023-12-01 00:04:38.354    5537
TradeTick      2023-12-01 00:03:25.998 2023-12-01 00:04:38.148     773

Added 131,072 data elements.

First you can notice that according to the max timestamp in first batch and min timestamp of second batch of OrderBookDeltas, duplicate data is being passed to the engine.

Second I check in my catalog that between the timestamps of the first batch fo OrderBookDelta I have more records in the catalog than what is indicated by first batch: 113756:

#+begin_src jupyter-python
instrument_str = "BTCUSDT-PERP.BINANCE"

START = "2023-11-30 23:59:58.028"
END = "2023-12-01 00:03:25.881"

start = dt_to_unix_nanos(pd.Timestamp(START, tz="UTC"))
end = dt_to_unix_nanos(pd.Timestamp(END, tz="UTC"))
inst_id = InstrumentId.from_str(instrument_str)

len(catalog.order_book_deltas(instrument_ids=[inst_id], start=start, end=end))
#+end_src

#+RESULTS:
: 300027
dimitar-petrov commented 8 months ago

I made sure that data is sorted in the underlying parquet files and looks like all data is monotonic increasing by ts_init.

Tick and Quote data is in single parquet files parttioned by date

|   | monotonic | dtype      | count    | start                   | end                     |
|---+-----------+------------+----------+-------------------------+-------------------------|
| 0 | True      | quote_tick | 16942636 | 2023-12-01 00:00:00.012 | 2023-12-01 23:59:59.983 |
| 1 | True      | trade_tick | 3053861  | 2023-12-01 00:00:00.012 | 2023-12-01 23:59:59.863 |

Order book deltas is partitioned by 5mln rows and daily.

|           | monotonic | dtype            | count   | start                   | end                     |
|-----------+-----------+------------------+---------+-------------------------+-------------------------|
| partition |           |                  |         |                         |                         |
| 0         | True      | order_book_delta | 96066   | 2023-11-30 23:59:49.004 | 2023-11-30 23:59:49.004 |
| 1         | True      | order_book_delta | 5000000 | 2023-11-30 23:59:49.004 | 2023-12-01 01:10:36.794 |
| 2         | True      | order_book_delta | 5000000 | 2023-12-01 01:10:36.794 | 2023-12-01 01:41:37.191 |
| 3         | True      | order_book_delta | 5000000 | 2023-12-01 01:41:37.191 | 2023-12-01 02:25:59.422 |
| 4         | True      | order_book_delta | 5000000 | 2023-12-01 02:25:59.423 | 2023-12-01 03:16:09.407 |
| 5         | True      | order_book_delta | 5000000 | 2023-12-01 03:16:09.407 | 2023-12-01 04:04:04.460 |
| 6         | True      | order_book_delta | 5000000 | 2023-12-01 04:04:04.460 | 2023-12-01 04:36:56.665 |
| 7         | True      | order_book_delta | 5000000 | 2023-12-01 04:36:56.666 | 2023-12-01 05:17:05.688 |
| 8         | True      | order_book_delta | 5000000 | 2023-12-01 05:17:05.688 | 2023-12-01 06:13:43.787 |
| 9         | True      | order_book_delta | 5000000 | 2023-12-01 06:13:43.789 | 2023-12-01 07:08:17.554 |
| 10        | True      | order_book_delta | 5000000 | 2023-12-01 07:08:17.554 | 2023-12-01 07:40:42.107 |
| 11        | True      | order_book_delta | 5000000 | 2023-12-01 07:40:42.107 | 2023-12-01 08:23:30.215 |
| 12        | True      | order_book_delta | 5000000 | 2023-12-01 08:23:30.215 | 2023-12-01 09:11:06.705 |
| 13        | True      | order_book_delta | 5000000 | 2023-12-01 09:11:06.705 | 2023-12-01 09:26:47.787 |
| 14        | True      | order_book_delta | 5000000 | 2023-12-01 09:26:47.787 | 2023-12-01 09:52:58.326 |
| 15        | True      | order_book_delta | 5000000 | 2023-12-01 09:52:58.326 | 2023-12-01 10:15:06.591 |
| 16        | True      | order_book_delta | 5000000 | 2023-12-01 10:15:06.591 | 2023-12-01 10:51:18.910 |
| 17        | True      | order_book_delta | 5000000 | 2023-12-01 10:51:18.911 | 2023-12-01 11:35:11.101 |
| 18        | True      | order_book_delta | 5000000 | 2023-12-01 11:35:11.101 | 2023-12-01 12:25:41.563 |
| 19        | True      | order_book_delta | 5000000 | 2023-12-01 12:25:41.563 | 2023-12-01 12:54:22.196 |
| 20        | True      | order_book_delta | 5000000 | 2023-12-01 12:54:22.196 | 2023-12-01 13:34:30.810 |
| 21        | True      | order_book_delta | 5000000 | 2023-12-01 13:34:30.812 | 2023-12-01 14:12:09.654 |
| 22        | True      | order_book_delta | 5000000 | 2023-12-01 14:12:09.654 | 2023-12-01 14:45:46.379 |
| 23        | True      | order_book_delta | 5000000 | 2023-12-01 14:45:46.379 | 2023-12-01 15:21:04.593 |
| 24        | True      | order_book_delta | 5000000 | 2023-12-01 15:21:04.594 | 2023-12-01 16:01:32.754 |
| 25        | True      | order_book_delta | 5000000 | 2023-12-01 16:01:32.756 | 2023-12-01 16:28:20.177 |
| 26        | True      | order_book_delta | 5000000 | 2023-12-01 16:28:20.177 | 2023-12-01 16:43:24.040 |
| 27        | True      | order_book_delta | 5000000 | 2023-12-01 16:43:24.040 | 2023-12-01 17:05:32.837 |
| 28        | True      | order_book_delta | 5000000 | 2023-12-01 17:05:32.838 | 2023-12-01 17:33:47.965 |
| 29        | True      | order_book_delta | 5000000 | 2023-12-01 17:33:47.970 | 2023-12-01 18:10:57.296 |
| 30        | True      | order_book_delta | 5000000 | 2023-12-01 18:10:57.296 | 2023-12-01 18:46:53.795 |
| 31        | True      | order_book_delta | 5000000 | 2023-12-01 18:46:53.795 | 2023-12-01 19:24:08.298 |
| 32        | True      | order_book_delta | 5000000 | 2023-12-01 19:24:08.298 | 2023-12-01 20:10:26.276 |
| 33        | True      | order_book_delta | 5000000 | 2023-12-01 20:10:26.279 | 2023-12-01 20:59:14.398 |
| 34        | True      | order_book_delta | 5000000 | 2023-12-01 20:59:14.398 | 2023-12-01 21:55:34.699 |
| 35        | True      | order_book_delta | 5000000 | 2023-12-01 21:55:34.699 | 2023-12-01 23:02:46.255 |
| 36        | True      | order_book_delta | 3894819 | 2023-12-01 23:02:46.256 | 2023-12-01 23:59:49.003 |
cjdsellers commented 8 months ago

@dimitar-petrov

Thank you for the detailed report, we'll look into this.

dimitar-petrov commented 8 months ago

Update: @cjdsellers. I noticed that reading order_book_deltas from the catalog does not come sorted always.

#+begin_src jupyter-python
%%time
def get_data(catalog, instrument_str, start, end):
    start = dt_to_unix_nanos(pd.Timestamp(START, tz="UTC"))
    end = dt_to_unix_nanos(pd.Timestamp(END, tz="UTC"))
    inst_id = InstrumentId.from_str(instrument_str)

    quotes = catalog.quote_ticks(instrument_ids=[inst_id], start=start, end=end)
    trades = catalog.trade_ticks(instrument_ids=[inst_id], start=start, end=end)
    deltas = catalog.order_book_deltas(instrument_ids=[inst_id], start=start, end=end)

    return quotes, trades, deltas

stats = []
for i in range(10):
    quotes, trades, deltas = get_data(catalog, instrument_str, START, END)

    qdf = pd.DataFrame(q.to_dict(q) for q in quotes)
    tdf = pd.DataFrame(t.to_dict(t) for t in trades)
    odf = pd.DataFrame(o.to_dict(o) for o in deltas)

    stats.append({
        "trades_size": len(tdf),
        "trades_sorted": tdf.ts_init.is_monotonic_increasing,
        "quode_size": len(qdf),
        "quote_sorted": qdf.ts_init.is_monotonic_increasing,
        "deltas_size": len(odf),
        "deltas_sorted": odf.ts_init.is_monotonic_increasing,
         }
    )
pd.DataFrame(stats)
#+end_src

#+RESULTS:
:RESULTS:
: CPU times: user 1.46 s, sys: 21.8 ms, total: 1.48 s
: Wall time: 1.43 s
|   | trades_size | trades_sorted | quode_size | quote_sorted | deltas_size | deltas_sorted |
|---+-------------+---------------+------------+--------------+-------------+---------------|
| 0 | 170         | True          | 3180       | True         | 16897       | False         |
| 1 | 170         | True          | 3180       | True         | 16897       | True          |
| 2 | 170         | True          | 3180       | True         | 16897       | False         |
| 3 | 170         | True          | 3180       | True         | 16897       | False         |
| 4 | 170         | True          | 3180       | True         | 16897       | True          |
| 5 | 170         | True          | 3180       | True         | 16897       | False         |
| 6 | 170         | True          | 3180       | True         | 16897       | False         |
| 7 | 170         | True          | 3180       | True         | 16897       | False         |
| 8 | 170         | True          | 3180       | True         | 16897       | False         |
| 9 | 170         | True          | 3180       | True         | 16897       | False         |
:END:

If reading the file with either polars or pandas the column ts_init is sorted.

#+begin_src jupyter-python
path = "~/tmp/catalog/data/order_book_delta/ARDRUSDT.BINANCE/part-0.parquet"

monotonic = []
for i in range(50):
    monotonic.append(
        pl.read_parquet(path, columns=list(pl.read_parquet_schema(path).keys()))
        .to_pandas()
        .ts_init.is_monotonic_increasing
    )

    monotonic.append(
        pd.read_parquet(path).ts_init.is_monotonic_increasing
    )
Counter(monotonic)
#+end_src

#+RESULTS:
: Counter({True: 100})

Hope that helps. Let me know if you manage do reproduce it.

dimitar-petrov commented 8 months ago

@twitu Let me know if I can assist with anything regarding the investgation and resolution of this issue.

twitu commented 8 months ago

Hi @dimitar-petrov, thanks for the detailed report. What is the dataset you are using? Is it possible to for you to share the data that's causing this issue? Since repo datasets don't cause this issue in the automated tests, it's difficult to reproduce.

You can also reach out on discord and share the dataset there if that's more comfortable.

dimitar-petrov commented 8 months ago

@twitu, I prepared a smaller dataset and ready to ship it to you on discord.

twitu commented 5 months ago

thank you for being patient @dimitar-petrov. There was a tricky configuration issue with how datafusion was reading large files which was leading to this weird issue.

https://github.com/nautechsystems/nautilus_trader/pull/1656

This PR has the fix. Please give it a try and share if things are working properly.