Open cuberone opened 3 weeks ago
Adding query += f' ORDER BY ts_init ASC'
to parquet.py solves my problem
The query and configuration have been updated in a very recent commit. Can you try running the same with the latest develop
.
Also to verify the row groups you can use the scripts in the experiments repo.
python extract_ts_init.py <parquet.file> <output.csv>
python check_invariant.py <output.csv>
Better, but still have overlaps (bold)
left=2022-08-04 23:59:59.064000, right=2022-08-05 14:45:23.647000 left=2022-08-05 14:18:54.041000, right=2022-08-06 15:33:02.780000 left=2022-08-06 15:33:02.781000, right=2022-08-07 23:13:44.271000 left=2022-08-07 22:07:58.330000, right=2022-08-08 15:23:17.163000 left=2022-08-08 15:23:17.204000, right=2022-08-09 10:57:06.002000 left=2022-08-09 10:30:18.964000, right=2022-08-10 01:54:01.909000 left=2022-08-10 01:54:01.920000, right=2022-08-10 14:19:11.231000 left=2022-08-10 13:54:42.202000, right=2022-08-11 02:35:14.550000 left=2022-08-11 02:35:14.780000, right=2022-08-11 14:38:31.114000 left=2022-08-11 14:14:16.760000, right=2022-08-12 05:22:41.413000
left=2022-08-04 23:59:59.064000, right=2022-08-12 05:22:41.413000 left=2022-08-12 05:22:41.423000, right=2022-08-19 08:48:40.846000 left=2022-08-19 08:48:40.846000, right=2022-08-26 13:02:26.749000 left=2022-08-26 13:02:26.749000, right=2022-09-07 18:47:38.892000 left=2022-09-07 18:47:38.930000, right=2022-09-13 12:53:38.882000 left=2022-09-13 12:53:38.882000, right=2022-09-19 22:04:35.665000 left=2022-09-19 22:04:35.665000, right=2022-09-26 14:02:38.785000 left=2022-09-26 14:02:38.785000, right=2022-10-03 06:51:29.801000 left=2022-10-03 06:07:46.222000, right=2022-10-13 05:13:00.931000 left=2022-10-13 03:03:31.569000, right=2022-10-25 01:27:48.375000 left=2022-10-24 23:44:30.731000, right=2022-11-05 11:03:21.601000 left=2022-11-05 09:29:46.009000, right=2022-11-10 02:41:44.591000
When I read the parquet file sequentially, I see unordered data. I suspect it was written in an unordered way despite the initial list being ordered. Have you tried sorting_columns
in write_table
Based on the documentation for sorting_columns
it seems purely for metadata purpose and is not used for actually sorting or verifying sorted data when writing. Datafusion has a couple of ongoing features that might use this data but currently that doesn't seem to be the case.
Specify the sort order of the data being written. The writer does not sort the data nor does it verify that the data is sorted. The sort order is written to the row group metadata, which can then be used by readers.
A small example to reproduce the unordered data writing will help find and debug if there's an issue with the nautilus data writers.
I used Binance BTCUSDT.P aggregated ticks for August 2022, please sort and write it to a catalog.
Bug Report
Chunks ranges overlap when streaming backtest.
Steps to Reproduce the Problem
BacktestNode
Specifications
nautilus_trader
version: 194