Open PaleNeutron opened 1 year ago
BTW, current PIT models seems not support adjusted point
in history. For example:
we have data like this
[
(20120411, 200904, 0.403925 , 4294967295),
...
(20111018, 201103, 0.318919 , 4294967295),
(20120323, 201104, 0.4039 , 420),
(20120411, 201004, 0.403925 , 4294967295), # adjust history data by company
]
if cur_time
is 20120411 and start_index
is -1 which may created by P(Ref($$roewa_q, 1))
, period_list
will be [201004, 201003]
.
This will lead to serious and difficult to debug bugs.
🌟 Feature Description
Current PIT implementation has a lot of performance trap and should be fixed.
Motivation
PIT feature is about 100 times slower than normal feature which is ridiculous. Financial PIT data usually have four points per year so it should be 50 faster than normal feature.
During review PIT code, I found following problems:
https://github.com/microsoft/qlib/blob/ecbeeafdc141ed89d5daf37ddfa20717190dfdb1/qlib/data/pit.py#L23-L48
In line 28, we loop every step in time series, and in each
_load_feature
function in line 39, we read the whole data file and index file.Which makes about 1000 times slower for 1000 trade day. data file and index file should be read only once for one feature.
_load_feature
is actually implemented here.https://github.com/microsoft/qlib/blob/ecbeeafdc141ed89d5daf37ddfa20717190dfdb1/qlib/data/data.py#L787-L794
Here, we read the whole data file but we pass
data_path
to nested function instead of data object!https://github.com/microsoft/qlib/blob/ecbeeafdc141ed89d5daf37ddfa20717190dfdb1/qlib/data/data.py#L813-L817
This will cause another 2 times slower.
Line 150, read file in python loop:
https://github.com/microsoft/qlib/blob/ecbeeafdc141ed89d5daf37ddfa20717190dfdb1/qlib/utils/__init__.py#L147-L156
OK, this may be acceptable in C but not in python. Python loop is very slow and even in C, deal with file content without stream buffer is not recommend.
Alternatives
Use current slow implementation.
Additional Notes
I'll try to re-implement PIT workflow.