microsoft / qlib

Qlib is an AI-oriented quantitative investment platform that aims to realize the potential, empower research, and create value using AI technologies in quantitative investment, from exploring ideas to implementing productions. Qlib supports diverse machine learning modeling paradigms. including supervised learning, market dynamics modeling, and RL.
https://qlib.readthedocs.io/en/latest/
MIT License
15.38k stars 2.63k forks source link

Improve PIT performance #1671

Open PaleNeutron opened 1 year ago

PaleNeutron commented 1 year ago

🌟 Feature Description

Current PIT implementation has a lot of performance trap and should be fixed.

Motivation

PIT feature is about 100 times slower than normal feature which is ridiculous. Financial PIT data usually have four points per year so it should be 50 faster than normal feature.

During review PIT code, I found following problems:

  1. PIT part

https://github.com/microsoft/qlib/blob/ecbeeafdc141ed89d5daf37ddfa20717190dfdb1/qlib/data/pit.py#L23-L48

In line 28, we loop every step in time series, and in each _load_feature function in line 39, we read the whole data file and index file.

Which makes about 1000 times slower for 1000 trade day. data file and index file should be read only once for one feature.

  1. LocalPITProvider part

_load_feature is actually implemented here.

https://github.com/microsoft/qlib/blob/ecbeeafdc141ed89d5daf37ddfa20717190dfdb1/qlib/data/data.py#L787-L794

Here, we read the whole data file but we pass data_path to nested function instead of data object!

https://github.com/microsoft/qlib/blob/ecbeeafdc141ed89d5daf37ddfa20717190dfdb1/qlib/data/data.py#L813-L817

This will cause another 2 times slower.

  1. read_period_data part

Line 150, read file in python loop:

https://github.com/microsoft/qlib/blob/ecbeeafdc141ed89d5daf37ddfa20717190dfdb1/qlib/utils/__init__.py#L147-L156

OK, this may be acceptable in C but not in python. Python loop is very slow and even in C, deal with file content without stream buffer is not recommend.

Alternatives

Use current slow implementation.

Additional Notes

I'll try to re-implement PIT workflow.

PaleNeutron commented 1 year ago

BTW, current PIT models seems not support adjusted point in history. For example:

we have data like this

[
     (20120411, 200904, 0.403925  , 4294967295),
     ...
     (20111018, 201103, 0.318919  , 4294967295),
     (20120323, 201104, 0.4039    ,        420),
     (20120411, 201004, 0.403925  , 4294967295),  # adjust history data by company
]

https://github.com/microsoft/qlib/blob/ecbeeafdc141ed89d5daf37ddfa20717190dfdb1/qlib/data/data.py#L797-L821

if cur_time is 20120411 and start_index is -1 which may created by P(Ref($$roewa_q, 1)), period_list will be [201004, 201003].

This will lead to serious and difficult to debug bugs.