microsoft / qlib

Qlib is an AI-oriented quantitative investment platform that aims to realize the potential, empower research, and create value using AI technologies in quantitative investment, from exploring ideas to implementing productions. Qlib supports diverse machine learning modeling paradigms. including supervised learning, market dynamics modeling, and RL.
https://qlib.readthedocs.io/en/latest/
MIT License
14.54k stars 2.53k forks source link

Backtest too slow when using my own data #1799

Closed TompaBay closed 1 month ago

TompaBay commented 1 month ago

I'm using CSMAR's data for chinese market and use following backtest code:

`from pprint import pprint

import qlib import pandas as pd from qlib.utils.time import Freq from qlib.utils import flatten_dict from qlib.backtest import backtest, executor from qlib.contrib.evaluate import risk_analysis from qlib.contrib.strategy import TopkDropoutStrategy

if name == "main":

qlib.init(provider_uri=r"../../benchmark/cn_data/qlib_data/")
score_df = pd.read_csv("../pred.csv")
score_df["datetime"] = pd.to_datetime(score_df["datetime"])
pred_score = score_df.set_index(["datetime", "instrument"])["score"]
CSI300_BENCH = "SH000300"
FREQ = "day"
STRATEGY_CONFIG = {
    "topk": 50,
    "n_drop": 10,
    "signal": pred_score,

}

EXECUTOR_CONFIG = {
    "time_per_step": "day",
    "generate_portfolio_metrics": True,
    "verbose": True,
}

backtest_config = {
    "start_time": "2016-01-01",
    "end_time": "2016-12-31",
    "account": 100000000,
    "benchmark": CSI300_BENCH,
    "exchange_kwargs": {
        "trade_unit": 100,
        "freq": FREQ,
        "limit_threshold": 0.095,
        "deal_price": "close",
        "open_cost": 0.0015,
        "close_cost": 0.0025,
        "min_cost": 5,
    },
}

strategy_obj = TopkDropoutStrategy(**STRATEGY_CONFIG)
executor_obj = executor.SimulatorExecutor(**EXECUTOR_CONFIG)

portfolio_metric_dict, indicator_dict = backtest(executor=executor_obj, strategy=strategy_obj, **backtest_config)
analysis_freq = "{0}{1}".format(*Freq.parse(FREQ))

report_normal, positions_normal = portfolio_metric_dict.get(analysis_freq)
analysis = dict()
analysis["excess_return_without_cost"] = risk_analysis(
    report_normal["return"] - report_normal["bench"], freq=analysis_freq
)
analysis["excess_return_with_cost"] = risk_analysis(
    report_normal["return"] - report_normal["bench"] - report_normal["cost"], freq=analysis_freq
)

analysis_df = pd.concat(analysis)  # type: pd.DataFrame
# log metrics
analysis_dict = flatten_dict(analysis_df["risk"].unstack().T.to_dict())
# print out results
pprint(f"The following are analysis results of benchmark return({analysis_freq}).")
pprint(risk_analysis(report_normal["bench"], freq=analysis_freq))
pprint(f"The following are analysis results of the excess return without cost({analysis_freq}).")
pprint(analysis["excess_return_without_cost"])
pprint(f"The following are analysis results of the excess return with cost({analysis_freq}).")
pprint(analysis["excess_return_with_cost"])`

The whole process was too slow that it took about 1 hour to test on just 1 year. Wonder what could be the reason? I do see the "future error", "no common_infra" and "nan in close" warning errors.

TompaBay commented 1 month ago

image It does show warnings like image above.