nkaz001 / hftbacktest

A high-frequency trading and market-making backtesting and trading bot in Python and Rust, which accounts for limit orders, queue positions, and latencies, utilizing full tick data for trades and order books, with real-world crypto market-making examples for Binance Futures
MIT License
1.96k stars 384 forks source link

Normalize Binance futures orderbook data #27

Closed quantitative-technologies closed 1 year ago

quantitative-technologies commented 1 year ago

I wanted to take advantage of the freely available historical futures orderbook level 2 data from binance.

It should be possible by combining this with historical trade data (also available from binance I believe) to obtain normalized data for htfbacktest.

But I couldn't find this in the repo examples. I wanted to check if it has already been done, so I don't waste time redoing it?

nkaz001 commented 1 year ago

There isn't. If you provide an example file for me to look into its format, I would add an example converter.

nkaz001 commented 1 year ago

By the way, without a local timestamp indicating when you received the feed, accurate backtesting is not possible, as there is no feed latency information. While you can artificially generate a local timestamp by assuming feed latency, it is preferable to collect the data yourself for more reliable results.

quantitative-technologies commented 1 year ago

There isn't. If you provide an example file for me to look into its format, I would add an example converter.

Here is example LOB data for a single day: https://drive.google.com/file/d/1rVaDblmYJL0aPpgvdJ-fU9QFhMDga6f_/view?usp=sharing

Btw, I was also happy to write it, but wanted to make sure I wasn't "reinventing the wheel".

quantitative-technologies commented 1 year ago

Yes, good point about the local timestamp. Thanks for the tip.

The artificial local timestamps are fine for my purposes at the moment.

nkaz001 commented 1 year ago

trade data is also required. still it's possible to backtest only based on depth data. it's meaningless especially in high freq. backtesting.

quantitative-technologies commented 1 year ago

Right. I was not suggesting trying to use OB data alone. Actually, I found your repo while looking for an implementation for inventory models, which of course need trade data to fit them.

The trade data is available from the Binance Public Data:

wget https://data.binance.vision/data/futures/um/daily/trades/BTCUSDT/BTCUSDT-trades-2020-07-01.zip

Here is the trade data corresponding to the above depth data.

nkaz001 commented 1 year ago

I added the converter. hftbacktest/data/utils/binancehistmktdata.py (a5d3f91)

could you check if it works as expected? again, in my experience, I have observed that backtest results can exhibit significant discrepancies unless precise feed latency and order latency are used.

quantitative-technologies commented 1 year ago

Excellent!

My plan was to look into the inventory MM model (as you gave an example of). I will report it if anything unexpected shows up.

I think you mean significant discrepancies between backtest and live trading results, but I am not doing any live trading at the moment. If you want to me to try out one of your other examples with the binance historical data, please let me know.

quantitative-technologies commented 1 year ago

I am getting an error using the following trade data, for ETHUSDT on 2022-10-03, as in your example notebook.

I think it is because the first row contains the column names, unlike the previous example. My guess is that the format has changed with newer data.

nkaz001 commented 1 year ago

Thanks for the report. Please see the latest commit. 740feee413795ea2a196077926e5def9e123229b

quantitative-technologies commented 1 year ago

Thanks for updating the code.

Now I can successfully run the data preparation notebook.

However, when I use the prepared data from binance in the Guéant–Lehalle–Fernandez-Tapia Market Making Model and Grid Trading notebook, it is off by a factor of about 2 in trading intensity from your calculated results. For example: hftbacktest_fit_2023-08-15_14-49-17

It's as if there are only half as many trades in the data files obtained from binance. To be safe, I added a 10ms feed latency, but as expected that does not affect the fitted model parameters.

Note that I had to adjust for the fact that the binance data is timestamped to milliseconds rather than microseconds.

Would it be possible to share your collected data for ETHUSDT futures on 2022-10-03 (e.g. on Google drive)? That way people could reproduce your results, and also I could directly compare the trade data to binance.

nkaz001 commented 1 year ago

For your information, I used trade stream instead of aggTrade stream which is currently officially documented but aggregated.

quantitative-technologies commented 1 year ago

I'm not sure I understand, since I also used trade data from binance, rather than aggTrade. In fact, your converter does not even work on the binance historical aggTrade data, though I don't see a need for it.

Unless you are suggesting that the trade data from binance is in fact still aggregated?

Anyhow, my plan is to collect my own data from the stream and then I can compare with the historical data from binance.

nkaz001 commented 1 year ago

No. But trade stream functions as expected, just like its description in the official spot API document, even though it is not outlined in the official futures API documents. So I guess Binance's historical data also came from aggTrade. Comparison is the most effective way for figuring things out.

quantitative-technologies commented 1 year ago

Another issue showed up: I was working with more recent data, and it has an additional undocumented field trans_id. This changes the offset of the other fields, and breaks the converter.

Here is an example of the recent snapshot data: https://drive.google.com/file/d/1y-9nt9V-eB_OV3uSq4-dzBe-eOsQDt4S/view?usp=sharing

nkaz001 commented 1 year ago

See 2b3137c3c643e9a96621e8fb0c3cd46ab0922dde and let me know if it works as expected.

quantitative-technologies commented 1 year ago

Code looks much better now without hard-coded indices, and it processes the snapshot fine.

But now it fails on the convert function call in the validation step with an exception.

Here is the lob data and trade data to reproduce this.

nkaz001 commented 1 year ago

See 7299d9a3968c7acc079dfffc5aad50c947e86cf2. I fixed the mingled timestamp issue but since the data hasn't local timestamp, there is no way but sorting. That can cause another discrepancy. Beware of that.

quantitative-technologies commented 1 year ago

Thanks! I tested it out and there were no more errors.

I'm not sure exactly what discrepancy you mean, but perhaps it will become more clear as I continue working on it.

nkaz001 commented 1 year ago

What I meant by that is that any difference from the live trading environment can cause a discrepancy.

phybrain commented 11 months ago

Thanks for updating the code.

Now I can successfully run the data preparation notebook.

However, when I use the prepared data from binance in the Guéant–Lehalle–Fernandez-Tapia Market Making Model and Grid Trading notebook, it is off by a factor of about 2 in trading intensity from your calculated results. For example: hftbacktest_fit_2023-08-15_14-49-17

It's as if there are only half as many trades in the data files obtained from binance. To be safe, I added a 10ms feed latency, but as expected that does not affect the fitted model parameters.

Note that I had to adjust for the fact that the binance data is timestamped to milliseconds rather than microseconds.

Would it be possible to share your collected data for ETHUSDT futures on 2022-10-03 (e.g. on Google drive)? That way people could reproduce your results, and also I could directly compare the trade data to binance.

Could you provide the code of Guéant–Lehalle–Fernandez-Tapia Market Making Model? :)

nkaz001 commented 11 months ago

you can find it on tutorials page or examples directory.

phybrain commented 11 months ago

you can find it on tutorials page or examples directory.

thanks