nkaz001 / hftbacktest

A high-frequency trading and market-making backtesting tool in Python and Rust, which accounts for limit orders, queue positions, and latencies, utilizing full tick data for trades and order books, with real-world crypto market-making examples for Binance Futures
MIT License
1.78k stars 357 forks source link

Feat: Update convert function in tardis.py to handle .parquet files #120

Closed ian-wazowski closed 1 month ago

ian-wazowski commented 1 month ago

Changes

    for file in input_files:
        print('Reading %s' % file)
        if file.endswith('.csv'):
            df = pl.read_csv(file)
        elif file.endswith('.parquet'):
            df = pl.read_parquet(file, pyarrow_options={'use_threads': True})
        else:
            raise ValueError('Unsupported file format: %s' % file)
        if df.columns == trade_cols:

Related

discord chat

nkaz001 commented 1 month ago

Tardis.dev provides the file in .csv.gz format. By the way, does Tardis also provide data in parquet format?

ian-wazowski commented 1 month ago

Tardis.dev provides the file in .csv.gz format. By the way, does Tardis also provide data in parquet format?

No, I'm working on downloading the tardis dataset and then converting it to parquet(lz4, column-wise encoding).

It's 10x faster to read than csv.gz, and the compression ratio increases by about 10-15%.

nkaz001 commented 1 month ago

The processing time required to convert raw Tardis data into Parquet format needs to be taken into account. In any case, I believe it's more appropriate to provide one as a separate data utility since the data has already been processed, not the raw Tardis data.