nkaz001 / hftbacktest

A high-frequency trading and market-making backtesting and trading bot in Python and Rust, which accounts for limit orders, queue positions, and latencies, utilizing full tick data for trades and order books, with real-world crypto market-making examples for Binance Futures
MIT License
1.96k stars 384 forks source link

Abnormal Differences between reading single file v.s. reading chunked file #16

Closed spacegoing closed 1 year ago

spacegoing commented 1 year ago

Hi nkaz001,

Many thanks for this awesome project!

Quick question, my orderbook data is too large to fit in memory, therefore I need to chunk one single file to multiple chunked files.

The problem is that the result is different between reading a single file and reading multiple chunk files.

The way I test is that I first created a single file example btcusdt_20230405.npz using the file btcusdt_20230405.dat.gz by convert() as documented. And then I split that file into 7 subfiles. I then running these two experiments by changing hbt's input like this:

image

Left is reading chunked files, Right is reading single file.

However, the backtest results are different: Single File Result:

image

Chunked File Result:

image

One possible reason is that, I did some research and find it really weird that from the 1324th environment step (elapse step), the current_timestamp between chunked hbt and single hbt become different:

image

I double checked the underlying data, they are identical.

I have two questions:

  1. Could you please think about the reason for this difference?
  2. What is the best practice for loading huge orderbook dataset?

My email is spacegoing@gmail.com. I am more than happy to have a quick chat with you to help u replay this issue.

Many thanks!

nkaz001 commented 1 year ago

Can you please inform me about the specific indexes used to divide the data so I try replicating the issue? Also, minor differences may happen if data is split and feed latency model is used due to the feed latency model's limitations; it cannot search the subsequent or preceding data chunks while attempting to find available feed latency. But, I'm not sure if this is the case.

Regarding handling large datasets, having ample memory is the most effective approach. Like you, I utilize chunking and have implemented lazy loading. However, it would be better if you have sufficient memory to load, at least, the entire day's data in one go.

spacegoing commented 1 year ago

Hi nkaz, many thanks for your reply!

I have uploaded my scripts and data here: https://github.com/spacegoing/hbt/tree/master/examples

btcusdt_20230404_eod.npz and btcusdt_20230405.npz are generated from documentation scripts.

I then use save_chunks_npz.py to save btcusdt_20230405.npz as 2MB large chunked files in the folder test_chunks. I double checked that chunked files are identical to original file.

I then use learn_pmm.py to test both single file and chunked file cases. And there are some test code under main.

Please let me know if there is anything else I can provide. Many thanks!

spacegoing commented 1 year ago

I tried with ConstantLatency and RiskAverseQueueModel, everything works as expected. Many thanks nkaz!