microsoft / qlib

Qlib is an AI-oriented quantitative investment platform that aims to realize the potential, empower research, and create value using AI technologies in quantitative investment, from exploring ideas to implementing productions. Qlib supports diverse machine learning modeling paradigms. including supervised learning, market dynamics modeling, and RL.
https://qlib.readthedocs.io/en/latest/
MIT License
14.54k stars 2.53k forks source link

dump_bin DumpDataUpdate mode append data error #1818

Open hbhuyt opened 4 days ago

hbhuyt commented 4 days ago

🐛 Bug Description

At first, I used Dump_bin's DumpDataAll mode to import data it worked fine. Part of the imported data is as follows df[df['instrument']=='SH600306'] Out[35]: instrument datetime $volume $factor $close 41691 SH600306 2024-04-23 1022018.0 0.281253 0.686257 41692 SH600306 2024-04-24 1372334.0 0.281253 0.652507 41693 SH600306 2024-04-25 951008.0 0.281253 0.618756 41694 SH600306 2024-04-26 1968818.0 0.281253 0.587818 41695 SH600306 2024-04-29 1532764.0 0.281253 0.559693

But when I append new data with DumpDataUpdate, there is an error. The original data is as follows dfraw.loc[(dfraw['date']>'2024-04-29'),['instrument','date','close']] Out[54]: instrument date close 4356 SH600306 2024-05-29 0.098438 4357 SH600306 2024-05-30 0.092813 4358 SH600306 2024-05-31 0.101251 4359 SH600306 2024-06-03 0.092813 4360 SH600306 2024-06-04 0.095626 4361 SH600306 2024-06-05 0.092813 4362 SH600306 2024-06-06 0.092813 4363 SH600306 2024-06-07 0.095626 4364 SH600306 2024-06-11 0.090001 4365 SH600306 2024-06-12 0.090001 4366 SH600306 2024-06-13 0.087188 4367 SH600306 2024-06-14 0.081563

Some of the imported data is shown below

dfnew[dfnew.instrument=='SH600306'] Out[8]: instrument datetime $volume $factor $close 10288 SH600306 2024-04-22 363992.0 0.281253 0.722820 10289 SH600306 2024-04-23 1022018.0 0.281253 0.686257 10290 SH600306 2024-04-24 1372334.0 0.281253 0.652507 10291 SH600306 2024-04-25 951008.0 0.281253 0.618756 10292 SH600306 2024-04-26 1968818.0 0.281253 0.587818 10293 SH600306 2024-04-29 1532764.0 0.281253 0.559693 10294 SH600306 2024-04-30 188390272.0 0.281253 0.098438 10295 SH600306 2024-05-06 117053368.0 0.281253 0.092813 10296 SH600306 2024-05-07 99965448.0 0.281253 0.101251 10297 SH600306 2024-05-08 85975896.0 0.281253 0.092813 10298 SH600306 2024-05-09 46003664.0 0.281253 0.095626 10299 SH600306 2024-05-10 61825620.0 0.281253 0.092813 10300 SH600306 2024-05-13 26138518.0 0.281253 0.092813 10301 SH600306 2024-05-14 19884768.0 0.281253 0.095626 10302 SH600306 2024-05-15 24197052.0 0.281253 0.090001 10303 SH600306 2024-05-16 12483558.0 0.281253 0.090001 10304 SH600306 2024-05-17 9390678.0 0.281253 0.087188 10305 SH600306 2024-05-20 27141916.0 0.281253 0.081563

I am hoping to debug dump_bin.py to find the problem. I ran it to here,the following code may be problem.

    def _data_to_bin(self, df: pd.DataFrame, calendar_list: List[pd.Timestamp], features_dir: Path):
        if df.empty:
            logger.warning(f"{features_dir.name} data is None or empty")
            return
        if not calendar_list:
            logger.warning("calendar_list is empty")
            return
        # align index
        _df = self.data_merge_calendar(df, calendar_list)
        if _df.empty:
            logger.warning(f"{features_dir.name} data is not in calendars")
            return

When align index, calendar_list does not contain dates such as 2024-05-06, but SH600306 data is empty in these days.