qianyun210603 / qlib

Qlib is an AI-oriented quantitative investment platform, which aims to realize the potential, empower the research, and create the value of AI technologies in quantitative investment. With Qlib, you can easily try your ideas to create better Quant investment strategies. An increasing number of SOTA Quant research works/papers are released in Qlib.
https://qlib.readthedocs.io/en/latest/
MIT License
14 stars 2 forks source link

where are Cross Sectional Factors? #1

Closed quant2008 closed 11 months ago

quant2008 commented 1 year ago

Hello, could you point out where your Cross Sectional Factor located in?

quant2008 commented 1 year ago

I see that they are in ops.py. But when I use alpha101, I get errors:

(qlib230510) G:\qlibtutor>E:/anaconda3/envs/qlib230510/python.exe g:/qlibtutor/advance/benchmarks_dynamic/baseline/my_rolling_benchmark.py [23644:MainThread](2023-09-04 19:00:33,996) INFO - qlib.Initialization - [config.py:416] - default_conf: client. [23644:MainThread](2023-09-04 19:00:33,998) INFO - qlib.Initialization - [init.py:74] - qlib successfully initialized based on client settings. [23644:MainThread](2023-09-04 19:00:33,999) INFO - qlib.Initialization - [init.py:76] - data_path={'DEFAULT_FREQ': WindowsPath('G:/qlibtutor/qlib_data/rq_cn_data')} my_conf_path G:\qlibtutor\advance\benchmarks_dynamic\baseline\my_workflow_config_linear_Alpha158.yaml [23644:MainThread](2023-09-04 19:00:34,009) INFO - qlib.Rolling - [base.py:164] - The prediction horizon is overrided [23644:MainThread](2023-09-04 19:00:37,473) ERROR - qlib.workflow - [utils.py:41] - An exception has been raised[KeyError: 'Unknown memcache unit']. File "g:/qlibtutor/advance/benchmarks_dynamic/baseline/my_rolling_benchmark.py", line 47, in RollingBenchmark(rtype="expanding", step=20).run() # "sliding", expanding File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\rolling\base.py", line 243, in run self._train_rolling_tasks() File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\rolling\base.py", line 191, in _train_rolling_tasks task_l = self.get_task_list() File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\rolling\base.py", line 180, in get_task_list task = self.basic_task(enable_handler_cache=True) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\rolling\base.py", line 170, in basic_task task = self._replace_hanler_with_cache(task) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\rolling\base.py", line 133, in _replace_hanler_with_cache task = replace_task_handler_with_cache(task, self.conf_path.parent) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\workflow\task\utils.py", line 309, in replace_task_handler_with_cache h = init_instance_by_config(handler) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\utils\mod.py", line 174, in init_instance_by_config return klass(cls_kwargs, try_kwargs, **kwargs) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\data\handler.py", line 839, in init super().init( File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\handler.py", line 468, in init super().init(instruments, start_time, end_time, data_loader, kwargs) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\handler.py", line 100, in init self.setup_data() File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\handler.py", line 610, in setup_data super().setup_data(kwargs) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\handler.py", line 144, in setup_data self._data = lazy_sort_index(self.data_loader.load(self.instruments, self.start_time, self.end_time)) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\loader.py", line 135, in load { File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\loader.py", line 136, in grp: self.load_group_df(instruments, exprs, names, start_time, end_time, grp) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\loader.py", line 217, in load_group_df df = D.features(instruments, exprs, start_time, end_time, freq=freq, inst_processors=inst_processors) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\data.py", line 1191, in features return DatasetD.dataset(instruments, fields, start_time, end_time, freq, inst_processors=inst_processors) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\data.py", line 924, in dataset data = self.dataset_processor( File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\data.py", line 578, in dataset_processor ParallelExt(n_jobs=workers, backend=C.joblib_backend, maxtasksperchild=C.maxtasksperchild)(task_l), File "E:\anaconda3\envs\qlib230510\lib\site-packages\joblib\parallel.py", line 1098, in call__ self.retrieve() File "E:\anaconda3\envs\qlib230510\lib\site-packages\joblib\parallel.py", line 975, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "E:\anaconda3\envs\qlib230510\lib\multiprocessing\pool.py", line 771, in get raise self._value KeyError: 'Unknown memcache unit'

qianyun210603 commented 1 year ago

I didn't recently use those cs operators for a while. Below I would provide some hints for you if you wish to debug:

  1. First do not use rolling, try with a clean, single workflow run to see if the issue occurs. Rolling engages many cache mechanism which I didn't test the XSections against (actually this rolling function is added by MS team after I develop the XSection ops).
  2. The KeyError: 'Unknown memcache unit' raises in MemCache in qlib/data/cache.py.
    def __getitem__(self, key):
        if key == "c":
            return self.__calendar_mem_cache
        elif key == "i":
            return self.__instrument_mem_cache
        elif key == "f":
            return self.__feature_mem_cache
        elif key == "fs":
            return self._feature_share_mem_cache
        else:
            raise KeyError(f"Unknown memcache unit {key}")

I'll suggest you to put a breakpoint there and see what key causes this error. Typically if you didn't mess up the code, it shouldn't ask for a non-existing cache type. Even the shared memory cache fs is not properly initialised (in qlib/data/data.py), it should return some NoneType not subscribe error instead a key error. So I'm curious what key you'll see at the breakpoint.

quant2008 commented 11 months ago

image 您好,如上是我的代码,您看CSRank的调用写法是否正常。出错后,我打印key的值是fs. 然后,我发现新版qlib没有fs这个判断分支了,如下,新版MemCache与你的老版不太一样,不知该如何改动,才能使用您的CSRank?: image

quant2008 commented 11 months ago

对了,我是用最新的ms的qlib,把你的CSRank拷贝过去,跑程序的,也许这样不合适。可能要直接用你的qlib跑才对,我后面试试

quant2008 commented 11 months ago

这次我安装了你的qlib,运行上述代码,出现如下错误,请问可以解决吗?: image

qianyun210603 commented 11 months ago

你先确定下用的是我最新的main分支,如果还是不行把你的配置发我下。你是用的官方的yahoo数据么?

qianyun210603 commented 11 months ago

fs我这里一直都有,本来就是为了CrossSection加的,官方的一直没有。

https://github.com/qianyun210603/qlib/blob/9879fff33154db0093c6255bb484b6f2621efcd6/qlib/data/cache.py#L201-L212

quant2008 commented 11 months ago

是用了你的main。运行的代码如下:

import qlib
from qlib.data.dataset.loader import QlibDataLoader

if __name__ == "__main__":
    qlib.init(provider_uri=r"G:\qlibrolling\qlib_data\cn_data", region="cn")

    fields = ['CSRank($close)', 'Abs($close)']  # 
    names = ['CSRank', 'close'] # 

    labels = ['Ref($close, -2)/Ref($close, -1) - 1']  # label
    label_names = ['LABEL']

    data_loader_config = {
    "feature": (fields, names),
    "label": (labels, label_names)
    }

    data_loader = QlibDataLoader(config=data_loader_config)
    df = data_loader.load(instruments='csi300', start_time='2017-01-01', end_time='2017-12-31')

    print(df)
quant2008 commented 11 months ago

数据是qlib自带的,如果fields里去掉CSRank字段,则以上代码能正确输出

qianyun210603 commented 11 months ago

看了下,这个应该是系统的问题。我是在linux下开发的,linux里面python类变量能跨进程传送,但windows下面不行。

因为qlib原生的计算是按股票并行的,所以想算截面的话,需要加锁好同步各个股票的数据。我的设计是每个截面因子一个锁这样不同截面因子不会互相干扰。但这样需要把锁放在一个dict里面传到各个进程,这点Windows怎么也做不到,无论是函数传参还是作为类变量。这个还得继续研究看有啥方案。

你要是一定要试试Windows的话,可以把这个锁改成全局锁,即所有截面因子公用一个锁,但肯定会慢。你要改的话,主要需要看下这两个地方 1. https://github.com/qianyun210603/qlib/blob/741c3f78f6f42592ed3cd4a6feebfeb205a62d53/qlib/data/cache.py#L148-L158 2. https://github.com/qianyun210603/qlib/blob/741c3f78f6f42592ed3cd4a6feebfeb205a62d53/qlib/data/ops.py#L2044-L2089

把locks从RLock的dict改成一个RLock,然后去掉索引。

不过就算你改了之后也会报个新错误,说缺少SH600074的key,这个纯是因为原始数据里面这个股票数据就是缺失的。

quant2008 commented 11 months ago

这样啊。谢谢您。看样子qlib截面因子是有问题。

timerobin commented 3 months ago

@qianyun210603 您好,我在qlib中添加CSRank遇到了相同的问题: KeyError: 'Unknown memcache unit', 然后我添加了

现在的报错如下,我尝试了不带CSRank的第101号因子,能跑通,但是带CSRank的则会报错,我的环境是Linux

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 

Traceback (most recent call last):
  File "/home/hyx/code/qlib/qlib/data/data.py", line 1186, in features
    return DatasetD.dataset(
TypeError: dataset() got multiple values for argument 'inst_processors'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/_utils.py", line 72, in __call__
    return self.func(**kwargs)
  File "/home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py", line 598, in __call__
    return [func(*args, **kwargs)
  File "/home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py", line 598, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/hyx/code/qlib/qlib/data/data.py", line 615, in inst_calculator
    obj[field] = ExpressionD.expression(inst, field, start_time, end_time, freq)
  File "/home/hyx/code/qlib/qlib/data/data.py", line 859, in expression
    series = expression.load(instrument, query_start, query_end, freq)
  File "/home/hyx/code/qlib/qlib/data/base.py", line 193, in load
    series = self._load_internal(instrument, start_index, end_index, *args)
  File "/home/hyx/code/qlib/qlib/data/ops.py", line 306, in _load_internal
    series_left = self.feature_left.load(instrument, start_index, end_index, *args)
  File "/home/hyx/code/qlib/qlib/data/base.py", line 193, in load
    series = self._load_internal(instrument, start_index, end_index, *args)
  File "/home/hyx/code/qlib/qlib/data/ops.py", line 1542, in _load_internal
    if cache_key not in H["fs"]:
TypeError: argument of type 'NoneType' is not iterable
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[1], line 215
    210 data_loader_config = {
    211     "feature": (fields, names),
    212     "label": (labels, label_names)
    213 }
    214 data_loader = QlibDataLoader(config=data_loader_config)
--> 215 df_feature = data_loader.load(instruments=market, start_time=start_time, end_time=end_time)
    218 # 处理器配置
    219 _DEFAULT_LEARN_PROCESSORS_riskfree = [
    220     {"class": "CSZScoreNorm", "kwargs": {"fields_group": "feature"}},
    221     {"class": "CSZScoreNorm", "kwargs": {"fields_group": "label"}},
   (...)
    224     {"class": "DropnaProcessor", "kwargs": {"fields_group": "feature"}},
    225 ]

File ~/code/qlib/qlib/data/dataset/loader.py:141, in DLWParser.load(self, instruments, start_time, end_time)
    138 def load(self, instruments=None, start_time=None, end_time=None) -> pd.DataFrame:
    139     if self.is_group:
    140         df = pd.concat(
--> 141             {
    142                 grp: self.load_group_df(instruments, exprs, names, start_time, end_time, grp)
    143                 for grp, (exprs, names) in self.fields.items()
    144             },
    145             axis=1,
    146         )
    147     else:
    148         exprs, names = self.fields

File ~/code/qlib/qlib/data/dataset/loader.py:142, in <dictcomp>(.0)
    138 def load(self, instruments=None, start_time=None, end_time=None) -> pd.DataFrame:
    139     if self.is_group:
    140         df = pd.concat(
    141             {
--> 142                 grp: self.load_group_df(instruments, exprs, names, start_time, end_time, grp)
    143                 for grp, (exprs, names) in self.fields.items()
    144             },
    145             axis=1,
    146         )
    147     else:
    148         exprs, names = self.fields

File ~/code/qlib/qlib/data/dataset/loader.py:223, in QlibDataLoader.load_group_df(self, instruments, exprs, names, start_time, end_time, gp_name)
    219 freq = self.freq[gp_name] if isinstance(self.freq, dict) else self.freq
    220 inst_processors = (
    221     self.inst_processors if isinstance(self.inst_processors, list) else self.inst_processors.get(gp_name, [])
    222 )
--> 223 df = D.features(instruments, exprs, start_time, end_time, freq=freq, inst_processors=inst_processors)
    224 df.columns = names
    225 if self.swap_level:

File ~/code/qlib/qlib/data/data.py:1190, in BaseProvider.features(self, instruments, fields, start_time, end_time, freq, disk_cache, inst_processors)
   1186     return DatasetD.dataset(
   1187         instruments, fields, start_time, end_time, freq, disk_cache, inst_processors=inst_processors
   1188     )
   1189 except TypeError:
-> 1190     return DatasetD.dataset(instruments, fields, start_time, end_time, freq, inst_processors=inst_processors)

File ~/code/qlib/qlib/data/data.py:923, in LocalDatasetProvider.dataset(self, instruments, fields, start_time, end_time, freq, inst_processors)
    921     start_time = cal[0]
    922     end_time = cal[-1]
--> 923 data = self.dataset_processor(
    924     instruments_d, column_names, start_time, end_time, freq, inst_processors=inst_processors
    925 )
    927 return data

File ~/code/qlib/qlib/data/data.py:577, in DatasetProvider.dataset_processor(instruments_d, column_names, start_time, end_time, freq, inst_processors)
    567     inst_l.append(inst)
    568     task_l.append(
    569         delayed(DatasetProvider.inst_calculator)(
    570             inst, start_time, end_time, freq, normalize_column_names, spans, C, inst_processors
    571         )
    572     )
    574 data = dict(
    575     zip(
    576         inst_l,
--> 577         ParallelExt(n_jobs=workers, backend=C.joblib_backend, maxtasksperchild=C.maxtasksperchild)(task_l),
    578     )
    579 )
    581 new_data = dict()
    582 for inst in sorted(data.keys()):

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:2007, in Parallel.__call__(self, iterable)
   2001 # The first item from the output is blank, but it makes the interpreter
   2002 # progress until it enters the Try/Except block of the generator and
   2003 # reaches the first `yield` statement. This starts the asynchronous
   2004 # dispatch of the tasks to the workers.
   2005 next(output)
-> 2007 return output if self.return_generator else list(output)

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:1650, in Parallel._get_outputs(self, iterator, pre_dispatch)
   1647     yield
   1649     with self._backend.retrieval_context():
-> 1650         yield from self._retrieve()
   1652 except GeneratorExit:
   1653     # The generator has been garbage collected before being fully
   1654     # consumed. This aborts the remaining tasks if possible and warn
   1655     # the user if necessary.
   1656     self._exception = True

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:1754, in Parallel._retrieve(self)
   1747 while self._wait_retrieval():
   1748 
   1749     # If the callback thread of a worker has signaled that its task
   1750     # triggered an exception, or if the retrieval loop has raised an
   1751     # exception (e.g. `GeneratorExit`), exit the loop and surface the
   1752     # worker traceback.
   1753     if self._aborting:
-> 1754         self._raise_error_fast()
   1755         break
   1757     # If the next job is not ready for retrieval yet, we just wait for
   1758     # async callbacks to progress.

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:1789, in Parallel._raise_error_fast(self)
   1785 # If this error job exists, immediately raise the error by
   1786 # calling get_result. This job might not exists if abort has been
   1787 # called directly or if the generator is gc'ed.
   1788 if error_job is not None:
-> 1789     error_job.get_result(self.timeout)

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:745, in BatchCompletionCallBack.get_result(self, timeout)
    739 backend = self.parallel._backend
    741 if backend.supports_retrieve_callback:
    742     # We assume that the result has already been retrieved by the
    743     # callback thread, and is stored internally. It's just waiting to
    744     # be returned.
--> 745     return self._return_or_raise()
    747 # For other backends, the main thread needs to run the retrieval step.
    748 try:

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:763, in BatchCompletionCallBack._return_or_raise(self)
    761 try:
    762     if self.status == TASK_ERROR:
--> 763         raise self._result
    764     return self._result
    765 finally:

TypeError: argument of type 'NoneType' is not iterable

我的代码如下:

import qlib
import pandas as pd
import numpy as np
from qlib.constant import REG_US
from qlib.utils import exists_qlib_data, init_instance_by_config
from qlib.workflow import R
from qlib.workflow.record_temp import SignalRecord, PortAnaRecord,SigAnaRecord
from qlib.utils import flatten_dict
import pylab as pl
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from qlib.data.dataset.handler import DataHandlerLP
from qlib.data.dataset.loader import QlibDataLoader
provider_uri = "/home/hyx/qlib_data/us/" 
market = "all"
start_time = '2012-01-01'
end_time = '2022-12-31'

qlib.init(provider_uri=provider_uri, region=REG_US)

f_return = "($close/Ref($close, 1)-1)"
f_adv5 = "Mean($money, 5)"
f_adv10 = "Mean($money, 10)"
f_adv15 = "Mean($money, 15)"
f_adv20 = "Mean($money, 20)"
f_adv30 = "Mean($money, 30)"
f_adv40 = "Mean($money, 40)"
f_adv50 = "Mean($money, 50)"
f_adv60 = "Mean($money, 60)"
f_adv120 = "Mean($money, 120)"
f_adv180 = "Mean($money, 180)"

alpha_components = {
    "alpha001": f"CSRank(IdxMax(Power(If({f_return}<0, Std({f_return}, 20), $close), 2), 5))-0.5",
}

figurefilepath = '/home/hyx/code/qlib/output/FormulaAlpha/'
sharpe_values = {}
alpha_name = 'alpha004'
fields = [alpha_components[alpha_name]] # MACD
names = [alpha_name]
labels = ['Ref($close, -11)/Ref($close, -1) - 1'] # label
label_names = ['LABEL']
data_loader_config = {
    "feature": (fields, names),
    "label": (labels, label_names)
}
data_loader = QlibDataLoader(config=data_loader_config)
df_feature = data_loader.load(instruments=market, start_time=start_time, end_time=end_time)
qianyun210603 commented 3 months ago

@timerobin 首先说清楚你在什么基础上改的,是官方的Qlib还是我这个的main分支。 我现在不大搞因子了。印象里当时改支持截面的时候改的地方不止一两个文件,还有配置文件也要改。

timerobin commented 3 months ago

@qianyun210603 您好,我在官方的Qlib的基础上改的,在qlib.data.ops.py加入了CSRank,CSScale,XSectionOperator,请问配置文件指哪些呀