nkaz001 / hftbacktest

A high-frequency trading and market-making backtesting and trading bot in Python and Rust, which accounts for limit orders, queue positions, and latencies, utilizing full tick data for trades and order books, with real-world crypto market-making examples for Binance Futures
MIT License
1.99k stars 390 forks source link

Backtests get stuck on some parts of Tardis data #31

Closed artiko88 closed 1 year ago

artiko88 commented 1 year ago

Hi!

Feel a bit awkward for asking so many questions, but i am knee deep into using this beautiful piece of software and it provides an opportunity to learn HFT with lighnting speed for which i am sincerely greatful.

So i am using tardis as a source of data. And i notice, that when i try to run a backtest on a month worth of data, sometimes it gets stuck while loading data from some of the files. It could happen right on the first file or somewhere in the middle in the 'loading file' phase. So i just skip the file, generate new snapshot and it may help. Usually, i have to skip two days of data presumably because i need to make a new snapshot and a snapshot based on the data that can't go through loading phase would also not allow the backtest to work.

Providing sample files would be asking too much, so maybe you could point me into direction on where to look for? I mean, maybe there some corrupt values in the array that prevent backtester from processing further. I checked those files ащк nans or outliers and can't find anything that seems strange or somehow different from the data in files that load just nicely.

To exclude ill coded function as a cause of data not being able to load, i created this one and pass an instance of hbt to it:

@njit(cache=True)
def pazz(
    hbt
):
    interval = 1_000_000

    while hbt.elapse(interval):
        pass

And this is how i start it:

base = 'SUI'
start_date = 20230702
end_date = 20230731

hbt = HftBacktest(
    [
        'data/{}USDT/binance-futures_{}.npz'.format(base, '{}-{}-{}'.format(str(date)[:4], str(date)[4:6], str(date)[6:])) for date in range(start_date, end_date)
    ],
    tick_size= 0.0001, 
    lot_size= 1,
    maker_fee= -0.0001,
    taker_fee= 0.0005,
    order_latency=FeedLatency(),
    queue_model=SquareProbQueueModel(),
    asset_type=Linear,
    exchange_model=PartialFillExchange,
    snapshot='data/{}USDT/binance-futures_{}-{}-{}_SNAPSHOT.npz'.format(base, str(start_date-1)[:4], str(start_date-1)[4:6], str(start_date-1)[6:])
)

pazz(
    hbt
)

And i still get stuck at loading the first file. If i start from 25th it loads nicely to 31th.

nkaz001 commented 1 year ago

Thanks for the report.

Does the specific file (date) have a problem? Or does the problematic file vary depending on the start date and end date configuration?

Could you try running without the order part?

for example,

@njit
def test(hbt):   
    tmp = np.empty((100_000_000, 3), np.float64)
    i = 0
    while hbt.elapse(60_000_000):
        hbt.clear_last_trades()
        tmp[i] = [hbt.current_timestamp, hbt.best_bid, hbt.best_ask]
        i += 1
    return tmp[:i]
nkaz001 commented 1 year ago

ah you already did it. could you also check if the timestamp is well-aligned?

it could be better if I can see the data file.

artiko88 commented 1 year ago

Ok, here are the results of those checks i performed on files wich go through backtests nicely and the one that gets stuck:

Here is the code for the test:

import numpy as np

ndata = np.load('data/binance-futures_2023-07-25.npz')
data = ndata['data']

# Check if timestamp in column 2 is always bigger than timestamp in column 1
invalid_timestamps = data[~(data[:, 2] >= data[:, 1])]
print("Rows where timestamp in column 2 is not bigger than timestamp in column 1:")
print(len(invalid_timestamps))

# Check if timestamps in columns 1 and 2 are always higher than in previous rows if timestamp is bigger than 0
invalid_timestamps_col1 = data[1:][(data[1:, 1] < data[:-1, 1]) & (data[1:, 1] > 0)]
invalid_timestamps_col2 = data[1:][(data[1:, 2] < data[:-1, 2]) & (data[1:, 2] > 0)]
print("Rows where timestamp in column 1 is not higher than in previous row:")
print(len(invalid_timestamps_col1))
print("Rows where timestamp in column 2 is not higher than in previous row:")
print(len(invalid_timestamps_col2))

# Check rows with zero timestamp
zero_timestamps = data[(data[:, 1] == 0) | (data[:, 2] == 0)]
print("Rows with zero timestamp:")
print(len(zero_timestamps))

Results for the file that gets stuck:

Rows where timestamp in column 2 is not bigger than timestamp in column 1:
37282
Rows where timestamp in column 1 is not higher than in previous row:
0
Rows where timestamp in column 2 is not higher than in previous row:
0
Rows with zero timestamp:
0

Snapshot of this file:

Rows where timestamp in column 2 is not bigger than timestamp in column 1:
2734
Rows where timestamp in column 1 is not higher than in previous row:
0
Rows where timestamp in column 2 is not higher than in previous row:
0
Rows with zero timestamp:
0

Results for the file that gets through:

Rows where timestamp in column 2 is not bigger than timestamp in column 1:
35509
Rows where timestamp in column 1 is not higher than in previous row:
0
Rows where timestamp in column 2 is not higher than in previous row:
0
Rows with zero timestamp:
0

Snapshot of this file:

Rows where timestamp in column 2 is not bigger than timestamp in column 1:
89
Rows where timestamp in column 1 is not higher than in previous row:
0
Rows where timestamp in column 2 is not higher than in previous row:
0
Rows with zero timestamp:
0

Here is the link to a snapshot of the previous day and following day merged data.

nkaz001 commented 1 year ago

I found Tardis snapshot data processing code was wrong so it caused missing market depth when the snapshot occurs. I guess it might be the cause. the fix is 0fcddbf3625a6a222cd4b8f5f52ce9af5685f623. could you regenerate the data and check if it's still stuck? if so, please provide the regen data again.

artiko88 commented 1 year ago

Just plowed through a month of data and hasn't even coughed!

Thank you, you are fantastic!

nkaz001 commented 1 year ago

thank you for confirming and for your report to improve this project=)

artiko88 commented 1 year ago

@nkaz001

Hi!

I am struggling with the same problem once again, sill on data from Tardis and on same time peiod (whole July), but this time the symbol is different.

And here is a link to problematic data converted to hftbacktest format: https://drive.google.com/file/d/1_i9szbHS1hYhw_68hoEf_rdkgUtkd49l/view?usp=sharing

And here is the link to the source of this data in case you would want to test conversion by yourself: https://drive.google.com/file/d/1YIN69QtLbto56P8u9DPwH0nLLzO7ZC83/view?usp=sharing

nkaz001 commented 1 year ago

Will look into it.

nkaz001 commented 1 year ago

I kept making mistakes. f2451732e16bf955ca8731a93622e6e8f5b03311 fixes the issue. But you need to regenerate the data.

artiko88 commented 1 year ago

Hey! This is the best news i've received since posting previous message here :) Will try it and report back.