microsoft / qlib

Qlib is an AI-oriented quantitative investment platform that aims to realize the potential, empower research, and create value using AI technologies in quantitative investment, from exploring ideas to implementing productions. Qlib supports diverse machine learning modeling paradigms. including supervised learning, market dynamics modeling, and RL.
https://qlib.readthedocs.io/en/latest/
MIT License
14.54k stars 2.53k forks source link

self.quote.get_data(stock_id, start_time, end_time, "$close") will faild with difference pd.Timestamp #1792

Closed jimrok closed 1 week ago

jimrok commented 1 month ago

šŸ› Bug Description

I encountered a bug while conducting a backtest with qlib. I'm not sure how to resolve it, so I'll first report where the issue arises: https://github.com/microsoft/qlib/blob/main/qlib/backtest/exchange.py#L389. Here, to query data from the quote object, a start_time parameter is required. The start_time is sourced from the trade_calendar during the backtest, and the data from the trade_calendar is read from a list of calendar files. The time construction in the calendar uses pd.Timestamp(x), which presents a problem: the pd.Timestamp type does not include nanosecond information. If this parameter is used for querying, it will result in incorrect data retrieval. Below, I will provide a simplified code snippet to illustrate this issue:

To Reproduce

Steps to reproduce the behavior: follwing code will be same result in describe the bug.

import qlib
from qlib.constant import REG_CN # region in [REG_CN, REG_US]
from qlib.data import D
from qlib.backtest.high_performance_ds import BaseQuote, NumpyQuote
import pandas as pd

freq = 'day'
start_time = '2023-01-03'
end_date = '2023-01-30'
provider_uri = 'd:/qlib_data/cn_data'
qlib.init(provider_uri=provider_uri, region=REG_CN)
codes = D.instruments()
all_fields = ["$close","$factor","$change","$volume"]
end_time = '2023-08-30'
quote_df = D.features(
            codes,
            all_fields,
            start_time,
            end_time,
            freq=freq,
            disk_cache=True,
        )

quote = NumpyQuote(quote_df,freq)

quote.get_data('000610.SZ', pd.to_datetime('2023-01-04'), pd.to_datetime('2023-01-04'), "$close")

# print the index map, you will find the index object is pd.Timestamp with ns time info. 
# 
# print(qdata['000610.SZ'].loc._indices[0].index_map)
# {numpy.datetime64('2023-01-03T00:00:00.000000000'): 0,
# numpy.datetime64('2023-01-04T00:00:00.000000000'): 1,
# numpy.datetime64('2023-01-05T00:00:00.000000000'): 2,...

# this line will get value.
quote.get_data('000610.SZ', pd.to_datetime('2023-01-04'), pd.to_datetime('2023-01-04'), "$close")

from qlib.data.data import Cal
_calendar = Cal.calendar(freq='day', future=True)
print(_calendar[-181]) # Timestamp('2023-01-04 00:00:00')

# this line will get None.
stime = _calendar[-181]
quote.get_data('000610.SZ', stime,stime, "$close")

Expected Behavior

Firstly, qlib supports the smallest frequency at the minute level, so there is no need to concern ourselves with whether pd.Timestamp includes nanosecond information. However, when constructing the NumpyQuote, it retains the nanosecond information in the index, which leads to inconsistency with the time generated by the calendar. It is hoped that both sides will use a unified method for time conversion when handling pd.Timestamp.

Environment

Note: User could run cd scripts && python collect_info.py all under project directory to get system information and paste them here directly.

Additional Notes

qew21 commented 1 week ago

The operation Cal.calendar(freq='day', future=True) yields a List[pd.Timestamp], aligning well with the output format of pd.to_datetime() which produces a pd.Timestamp. Consequently, there's no discrepancy between the following two code snippets. Through personal testing, both methods successfully retrieve identical values:

# Method 1: Utilizing pd.to_datetime for date conversion
data1 = quote.get_data('000610.SZ', pd.to_datetime('2023-01-04'), pd.to_datetime('2023-01-04'), "$close")

# Method 2: Leveraging the calendar list for date specification
data2 = quote.get_data('000610.SZ', _calendar[-181], _calendar[-181], "$close")

numpy.datetime64, utilized within the context of NumpyQuote, is the direct output of invoking pd.Timestamp.to_numpy(). Hence, current time formats are in consistent.