Closed dimitar-petrov closed 5 months ago
Hi @dimitar-petrov
Many thanks for the report.
We have a k-merge algorithm under the hood in the Rust backend session which should ensure non-decreasing timestamps as it reads from multiple parquet sources.
It's possible that this is a timestamping issue with the logging and not the data, are you able to log the actual ts_init
of the data, or add an assertion in your strategy to check that data isn't being received out of order?
Also, we generally only investigate bug reports on the latest stable version which is currently 1.188.0, and I notice you're back on what looks like 1.175.0 (and there have been a tonne of fixes, especially in the last release).
Hi @cjdsellers,
It's possible that this is a timestamping issue with the logging and not the data, are you able to log the actual
ts_init
of the data, or add an assertion in your strategy to check that data isn't being received out of order?
I am logging the ts_init actually but you are correct that logging can still print out of order. Will do the assertion in the strategy, that sounds like a great idea.
Also, we generally only investigate bug reports on the latest stable version which is currently 1.188.0, and I notice you're back on what looks like 1.175.0 (and there have been a tonne of fixes, especially in the last release).
Will test on the latest stable and update.
Thanks for the quick response.
@cjdsellers,
I have pulled the latest stable but the issue persist in my env.
Took the recommended route to assert ts_init
in the strategy.
Received Events Out Of Order:
defaultdict(<class 'int'>, {'OrderBookDeltas': 1420567, 'TradeTick': 18963, 'QuoteTick': 111175})
Received Events Out Of Order: defaultdict(<class 'int'>, {})
To me looks like k-merge algo does not work as expected but I am open to any recommendations to check in my env.
Please take a look at the patch of node.py
I have shared above and look at just the first two batches of data:
I am logging the data stats that is comming from k-merge in rust with [Data Chunk] Stats
marker.
Each chunk is grouped by datatype and max and min timestamp are printed before data is passed to the backtest engine.
=================================================================
BACKTEST RUN
=================================================================
Run config ID: f0d8ecbde85c3adf31907c9ca4a0882f8ca3f4bad336a2922d4e08b318f77a93
Run ID: c3fbb5b8-ee03-453c-93a2-a09fc75690a8
Run started: 2024-02-27 12:37:32.706503+00:00
Backtest start: 2023-11-30 23:59:49.004000+00:00
Batch start: 2023-11-30 23:59:49.004000+00:00
Batch end: 2023-12-01 00:01:24.562000+00:00
-----------------------------------------------------------------
[Data Chunk] Stats:
min max count
type
OrderBookDelta 2023-11-30 23:59:58.028 2023-12-01 00:03:25.881 113756
QuoteTick 2023-12-01 00:01:24.563 2023-12-01 00:03:25.852 15160
TradeTick 2023-12-01 00:01:24.718 2023-12-01 00:03:25.698 2156
Added 131,072 data elements.
=================================================================
BACKTEST RUN
=================================================================
Run config ID: f0d8ecbde85c3adf31907c9ca4a0882f8ca3f4bad336a2922d4e08b318f77a93
Run ID: c3fbb5b8-ee03-453c-93a2-a09fc75690a8
Run started: 2024-02-27 12:37:32.706503+00:00
Backtest start: 2023-11-30 23:59:49.004000+00:00
Batch start: 2023-11-30 23:59:49.004000+00:00
Batch end: 2023-12-01 00:03:25.881000+00:00
-----------------------------------------------------------------
[Data Chunk] Stats:
min max count
type
OrderBookDelta 2023-12-01 00:00:24.711 2023-12-01 00:04:38.362 124762
QuoteTick 2023-12-01 00:03:25.998 2023-12-01 00:04:38.354 5537
TradeTick 2023-12-01 00:03:25.998 2023-12-01 00:04:38.148 773
Added 131,072 data elements.
First you can notice that according to the max timestamp in first batch and min timestamp of second batch of OrderBookDeltas, duplicate data is being passed to the engine.
Second I check in my catalog that between the timestamps of the first batch fo OrderBookDelta
I have more records in the catalog than what is indicated by first batch: 113756:
#+begin_src jupyter-python
instrument_str = "BTCUSDT-PERP.BINANCE"
START = "2023-11-30 23:59:58.028"
END = "2023-12-01 00:03:25.881"
start = dt_to_unix_nanos(pd.Timestamp(START, tz="UTC"))
end = dt_to_unix_nanos(pd.Timestamp(END, tz="UTC"))
inst_id = InstrumentId.from_str(instrument_str)
len(catalog.order_book_deltas(instrument_ids=[inst_id], start=start, end=end))
#+end_src
#+RESULTS:
: 300027
I made sure that data is sorted in the underlying parquet files and looks like all data is monotonic increasing by ts_init
.
Tick and Quote data is in single parquet files parttioned by date
| | monotonic | dtype | count | start | end |
|---+-----------+------------+----------+-------------------------+-------------------------|
| 0 | True | quote_tick | 16942636 | 2023-12-01 00:00:00.012 | 2023-12-01 23:59:59.983 |
| 1 | True | trade_tick | 3053861 | 2023-12-01 00:00:00.012 | 2023-12-01 23:59:59.863 |
Order book deltas is partitioned by 5mln rows and daily.
| | monotonic | dtype | count | start | end |
|-----------+-----------+------------------+---------+-------------------------+-------------------------|
| partition | | | | | |
| 0 | True | order_book_delta | 96066 | 2023-11-30 23:59:49.004 | 2023-11-30 23:59:49.004 |
| 1 | True | order_book_delta | 5000000 | 2023-11-30 23:59:49.004 | 2023-12-01 01:10:36.794 |
| 2 | True | order_book_delta | 5000000 | 2023-12-01 01:10:36.794 | 2023-12-01 01:41:37.191 |
| 3 | True | order_book_delta | 5000000 | 2023-12-01 01:41:37.191 | 2023-12-01 02:25:59.422 |
| 4 | True | order_book_delta | 5000000 | 2023-12-01 02:25:59.423 | 2023-12-01 03:16:09.407 |
| 5 | True | order_book_delta | 5000000 | 2023-12-01 03:16:09.407 | 2023-12-01 04:04:04.460 |
| 6 | True | order_book_delta | 5000000 | 2023-12-01 04:04:04.460 | 2023-12-01 04:36:56.665 |
| 7 | True | order_book_delta | 5000000 | 2023-12-01 04:36:56.666 | 2023-12-01 05:17:05.688 |
| 8 | True | order_book_delta | 5000000 | 2023-12-01 05:17:05.688 | 2023-12-01 06:13:43.787 |
| 9 | True | order_book_delta | 5000000 | 2023-12-01 06:13:43.789 | 2023-12-01 07:08:17.554 |
| 10 | True | order_book_delta | 5000000 | 2023-12-01 07:08:17.554 | 2023-12-01 07:40:42.107 |
| 11 | True | order_book_delta | 5000000 | 2023-12-01 07:40:42.107 | 2023-12-01 08:23:30.215 |
| 12 | True | order_book_delta | 5000000 | 2023-12-01 08:23:30.215 | 2023-12-01 09:11:06.705 |
| 13 | True | order_book_delta | 5000000 | 2023-12-01 09:11:06.705 | 2023-12-01 09:26:47.787 |
| 14 | True | order_book_delta | 5000000 | 2023-12-01 09:26:47.787 | 2023-12-01 09:52:58.326 |
| 15 | True | order_book_delta | 5000000 | 2023-12-01 09:52:58.326 | 2023-12-01 10:15:06.591 |
| 16 | True | order_book_delta | 5000000 | 2023-12-01 10:15:06.591 | 2023-12-01 10:51:18.910 |
| 17 | True | order_book_delta | 5000000 | 2023-12-01 10:51:18.911 | 2023-12-01 11:35:11.101 |
| 18 | True | order_book_delta | 5000000 | 2023-12-01 11:35:11.101 | 2023-12-01 12:25:41.563 |
| 19 | True | order_book_delta | 5000000 | 2023-12-01 12:25:41.563 | 2023-12-01 12:54:22.196 |
| 20 | True | order_book_delta | 5000000 | 2023-12-01 12:54:22.196 | 2023-12-01 13:34:30.810 |
| 21 | True | order_book_delta | 5000000 | 2023-12-01 13:34:30.812 | 2023-12-01 14:12:09.654 |
| 22 | True | order_book_delta | 5000000 | 2023-12-01 14:12:09.654 | 2023-12-01 14:45:46.379 |
| 23 | True | order_book_delta | 5000000 | 2023-12-01 14:45:46.379 | 2023-12-01 15:21:04.593 |
| 24 | True | order_book_delta | 5000000 | 2023-12-01 15:21:04.594 | 2023-12-01 16:01:32.754 |
| 25 | True | order_book_delta | 5000000 | 2023-12-01 16:01:32.756 | 2023-12-01 16:28:20.177 |
| 26 | True | order_book_delta | 5000000 | 2023-12-01 16:28:20.177 | 2023-12-01 16:43:24.040 |
| 27 | True | order_book_delta | 5000000 | 2023-12-01 16:43:24.040 | 2023-12-01 17:05:32.837 |
| 28 | True | order_book_delta | 5000000 | 2023-12-01 17:05:32.838 | 2023-12-01 17:33:47.965 |
| 29 | True | order_book_delta | 5000000 | 2023-12-01 17:33:47.970 | 2023-12-01 18:10:57.296 |
| 30 | True | order_book_delta | 5000000 | 2023-12-01 18:10:57.296 | 2023-12-01 18:46:53.795 |
| 31 | True | order_book_delta | 5000000 | 2023-12-01 18:46:53.795 | 2023-12-01 19:24:08.298 |
| 32 | True | order_book_delta | 5000000 | 2023-12-01 19:24:08.298 | 2023-12-01 20:10:26.276 |
| 33 | True | order_book_delta | 5000000 | 2023-12-01 20:10:26.279 | 2023-12-01 20:59:14.398 |
| 34 | True | order_book_delta | 5000000 | 2023-12-01 20:59:14.398 | 2023-12-01 21:55:34.699 |
| 35 | True | order_book_delta | 5000000 | 2023-12-01 21:55:34.699 | 2023-12-01 23:02:46.255 |
| 36 | True | order_book_delta | 3894819 | 2023-12-01 23:02:46.256 | 2023-12-01 23:59:49.003 |
@dimitar-petrov
Thank you for the detailed report, we'll look into this.
Update: @cjdsellers. I noticed that reading order_book_deltas
from the catalog does not come sorted always.
#+begin_src jupyter-python
%%time
def get_data(catalog, instrument_str, start, end):
start = dt_to_unix_nanos(pd.Timestamp(START, tz="UTC"))
end = dt_to_unix_nanos(pd.Timestamp(END, tz="UTC"))
inst_id = InstrumentId.from_str(instrument_str)
quotes = catalog.quote_ticks(instrument_ids=[inst_id], start=start, end=end)
trades = catalog.trade_ticks(instrument_ids=[inst_id], start=start, end=end)
deltas = catalog.order_book_deltas(instrument_ids=[inst_id], start=start, end=end)
return quotes, trades, deltas
stats = []
for i in range(10):
quotes, trades, deltas = get_data(catalog, instrument_str, START, END)
qdf = pd.DataFrame(q.to_dict(q) for q in quotes)
tdf = pd.DataFrame(t.to_dict(t) for t in trades)
odf = pd.DataFrame(o.to_dict(o) for o in deltas)
stats.append({
"trades_size": len(tdf),
"trades_sorted": tdf.ts_init.is_monotonic_increasing,
"quode_size": len(qdf),
"quote_sorted": qdf.ts_init.is_monotonic_increasing,
"deltas_size": len(odf),
"deltas_sorted": odf.ts_init.is_monotonic_increasing,
}
)
pd.DataFrame(stats)
#+end_src
#+RESULTS:
:RESULTS:
: CPU times: user 1.46 s, sys: 21.8 ms, total: 1.48 s
: Wall time: 1.43 s
| | trades_size | trades_sorted | quode_size | quote_sorted | deltas_size | deltas_sorted |
|---+-------------+---------------+------------+--------------+-------------+---------------|
| 0 | 170 | True | 3180 | True | 16897 | False |
| 1 | 170 | True | 3180 | True | 16897 | True |
| 2 | 170 | True | 3180 | True | 16897 | False |
| 3 | 170 | True | 3180 | True | 16897 | False |
| 4 | 170 | True | 3180 | True | 16897 | True |
| 5 | 170 | True | 3180 | True | 16897 | False |
| 6 | 170 | True | 3180 | True | 16897 | False |
| 7 | 170 | True | 3180 | True | 16897 | False |
| 8 | 170 | True | 3180 | True | 16897 | False |
| 9 | 170 | True | 3180 | True | 16897 | False |
:END:
If reading the file with either polars or pandas the column ts_init
is sorted.
#+begin_src jupyter-python
path = "~/tmp/catalog/data/order_book_delta/ARDRUSDT.BINANCE/part-0.parquet"
monotonic = []
for i in range(50):
monotonic.append(
pl.read_parquet(path, columns=list(pl.read_parquet_schema(path).keys()))
.to_pandas()
.ts_init.is_monotonic_increasing
)
monotonic.append(
pd.read_parquet(path).ts_init.is_monotonic_increasing
)
Counter(monotonic)
#+end_src
#+RESULTS:
: Counter({True: 100})
Hope that helps. Let me know if you manage do reproduce it.
@twitu Let me know if I can assist with anything regarding the investgation and resolution of this issue.
Hi @dimitar-petrov, thanks for the detailed report. What is the dataset you are using? Is it possible to for you to share the data that's causing this issue? Since repo datasets don't cause this issue in the automated tests, it's difficult to reproduce.
You can also reach out on discord and share the dataset there if that's more comfortable.
@twitu, I prepared a smaller dataset and ready to ship it to you on discord.
thank you for being patient @dimitar-petrov. There was a tricky configuration issue with how datafusion was reading large files which was leading to this weird issue.
https://github.com/nautechsystems/nautilus_trader/pull/1656
This PR has the fix. Please give it a try and share if things are working properly.
Bug Report
Expected Behavior
Data from multiple datasets sholud be streamed following the temporal dependencies.
Actual Behavior
I created streaming backtesting engine with
batch_size_bytes=2**17
and a DummyStrategy that is logging when receives any data with the following statement.And noticed that the
ts_init
is not monotonic increasing which means that timestamps of events are not received chronologically.Also pathched the
backtest/node.py
as follows to debug.Logs
How are datasources with different volume synchronized when passed to the engine. I can see that
DataBackendSession
is queried with specific chunksize but is the rust side of that object taking care of synchronization of multiple data sets.Specifications
nautilus_trader
version: 1.75