Feat: Update convert function in tardis.py to handle .parquet files

nkaz001 / hftbacktest

A high-frequency trading and market-making backtesting and trading bot in Python and Rust, which accounts for limit orders, queue positions, and latencies, utilizing full tick data for trades and order books, with real-world crypto market-making examples for Binance Futures

MIT License

2.01k stars 395 forks source link

Feat: Update convert function in tardis.py to handle .parquet files #120

Closed ian-wazowski closed 3 months ago

ian-wazowski commented 3 months ago

Changes

    for file in input_files:
        print('Reading %s' % file)
        if file.endswith('.csv'):
            df = pl.read_csv(file)
        elif file.endswith('.parquet'):
            df = pl.read_parquet(file, pyarrow_options={'use_threads': True})
        else:
            raise ValueError('Unsupported file format: %s' % file)
        if df.columns == trade_cols:

discord chat

nkaz001 commented 3 months ago

Tardis.dev provides the file in .csv.gz format. By the way, does Tardis also provide data in parquet format?

ian-wazowski commented 3 months ago

Tardis.dev provides the file in .csv.gz format. By the way, does Tardis also provide data in parquet format?

No, I'm working on downloading the tardis dataset and then converting it to parquet(lz4, column-wise encoding).

It's 10x faster to read than csv.gz, and the compression ratio increases by about 10-15%.

nkaz001 commented 3 months ago

The processing time required to convert raw Tardis data into Parquet format needs to be taken into account. In any case, I believe it's more appropriate to provide one as a separate data utility since the data has already been processed, not the raw Tardis data.

nkaz001 / hftbacktest

Feat: Update convert function in tardis.py to handle .parquet files #120

Changes

Related