cuberone commented 3 weeks ago

Bug Report

Chunks ranges overlap when streaming backtest.

Steps to Reproduce the Problem

Add sorted trade ticks to the catalog.

Print the left and right bounds of the chunks, when loading ticks in BacktestNode

for chunk in session.to_query_result():
chunk_data = capsule_to_list(chunk)
chunk_data.sort(key=lambda x: (x.ts_event, x.trade_id))
print(f'left={pd.Timestamp(chunk_data[0].ts_event)}, right={pd.Timestamp(chunk_data[-1].ts_event)}')
continue

Chunks ranges overlap.

left=2022-08-01 00:00:00.135000, right=2022-08-26 14:57:14.882000
left=2022-08-01 19:35:13.304000, right=2022-08-27 05:17:23.872000
left=2022-08-02 14:36:08.297000, right=2022-08-31 04:16:05.561000
left=2022-08-03 08:05:04.459000, right=2022-08-31 23:59:59.930000

Specifications

OS platform:
Python version:
nautilus_trader version: 194

cuberone commented 3 weeks ago

Adding query += f' ORDER BY ts_init ASC' to parquet.py solves my problem

twitu commented 3 weeks ago

The query and configuration have been updated in a very recent commit. Can you try running the same with the latest develop.

Also to verify the row groups you can use the scripts in the experiments repo.

python extract_ts_init.py <parquet.file> <output.csv>
python check_invariant.py <output.csv>

cuberone commented 3 weeks ago

Better, but still have overlaps (bold)

batch_size_bytes=1_000_000

left=2022-08-04 23:59:59.064000, right=2022-08-05 14:45:23.647000 left=2022-08-05 14:18:54.041000, right=2022-08-06 15:33:02.780000 left=2022-08-06 15:33:02.781000, right=2022-08-07 23:13:44.271000 left=2022-08-07 22:07:58.330000, right=2022-08-08 15:23:17.163000 left=2022-08-08 15:23:17.204000, right=2022-08-09 10:57:06.002000 left=2022-08-09 10:30:18.964000, right=2022-08-10 01:54:01.909000 left=2022-08-10 01:54:01.920000, right=2022-08-10 14:19:11.231000 left=2022-08-10 13:54:42.202000, right=2022-08-11 02:35:14.550000 left=2022-08-11 02:35:14.780000, right=2022-08-11 14:38:31.114000 left=2022-08-11 14:14:16.760000, right=2022-08-12 05:22:41.413000

batch_size_bytes=10_000_000

left=2022-08-04 23:59:59.064000, right=2022-08-12 05:22:41.413000 left=2022-08-12 05:22:41.423000, right=2022-08-19 08:48:40.846000 left=2022-08-19 08:48:40.846000, right=2022-08-26 13:02:26.749000 left=2022-08-26 13:02:26.749000, right=2022-09-07 18:47:38.892000 left=2022-09-07 18:47:38.930000, right=2022-09-13 12:53:38.882000 left=2022-09-13 12:53:38.882000, right=2022-09-19 22:04:35.665000 left=2022-09-19 22:04:35.665000, right=2022-09-26 14:02:38.785000 left=2022-09-26 14:02:38.785000, right=2022-10-03 06:51:29.801000 left=2022-10-03 06:07:46.222000, right=2022-10-13 05:13:00.931000 left=2022-10-13 03:03:31.569000, right=2022-10-25 01:27:48.375000 left=2022-10-24 23:44:30.731000, right=2022-11-05 11:03:21.601000 left=2022-11-05 09:29:46.009000, right=2022-11-10 02:41:44.591000

cuberone commented 3 weeks ago

When I read the parquet file sequentially, I see unordered data. I suspect it was written in an unordered way despite the initial list being ordered. Have you tried sorting_columns in write_table

twitu commented 3 weeks ago

Based on the documentation for sorting_columns it seems purely for metadata purpose and is not used for actually sorting or verifying sorted data when writing. Datafusion has a couple of ongoing features that might use this data but currently that doesn't seem to be the case.

Specify the sort order of the data being written. The writer does not sort the data nor does it verify that the data is sorted. The sort order is written to the row group metadata, which can then be used by readers.

A small example to reproduce the unordered data writing will help find and debug if there's an issue with the nautilus data writers.

cuberone commented 3 weeks ago

I used Binance BTCUSDT.P aggregated ticks for August 2022, please sort and write it to a catalog.

nautechsystems / nautilus_trader

Backtest in streaming mode doesn't work #1698

Bug Report

Steps to Reproduce the Problem

Specifications