where are Cross Sectional Factors?

quant2008 commented 1 year ago

Hello, could you point out where your Cross Sectional Factor located in?

quant2008 commented 1 year ago

I see that they are in ops.py. But when I use alpha101, I get errors:

(qlib230510) G:\qlibtutor>E:/anaconda3/envs/qlib230510/python.exe g:/qlibtutor/advance/benchmarks_dynamic/baseline/my_rolling_benchmark.py [23644:MainThread](2023-09-04 19:00:33,996) INFO - qlib.Initialization - [config.py:416] - default_conf: client. [23644:MainThread](2023-09-04 19:00:33,998) INFO - qlib.Initialization - [init.py:74] - qlib successfully initialized based on client settings. [23644:MainThread](2023-09-04 19:00:33,999) INFO - qlib.Initialization - [init.py:76] - data_path={'DEFAULT_FREQ': WindowsPath('G:/qlibtutor/qlib_data/rq_cn_data')} my_conf_path G:\qlibtutor\advance\benchmarks_dynamic\baseline\my_workflow_config_linear_Alpha158.yaml [23644:MainThread](2023-09-04 19:00:34,009) INFO - qlib.Rolling - [base.py:164] - The prediction horizon is overrided [23644:MainThread](2023-09-04 19:00:37,473) ERROR - qlib.workflow - [utils.py:41] - An exception has been raised[KeyError: 'Unknown memcache unit']. File "g:/qlibtutor/advance/benchmarks_dynamic/baseline/my_rolling_benchmark.py", line 47, in RollingBenchmark(rtype="expanding", step=20).run() # "sliding", expanding File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\rolling\base.py", line 243, in run self._train_rolling_tasks() File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\rolling\base.py", line 191, in _train_rolling_tasks task_l = self.get_task_list() File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\rolling\base.py", line 180, in get_task_list task = self.basic_task(enable_handler_cache=True) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\rolling\base.py", line 170, in basic_task task = self._replace_hanler_with_cache(task) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\rolling\base.py", line 133, in _replace_hanler_with_cache task = replace_task_handler_with_cache(task, self.conf_path.parent) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\workflow\task\utils.py", line 309, in replace_task_handler_with_cache h = init_instance_by_config(handler) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\utils\mod.py", line 174, in init_instance_by_config return klass(cls_kwargs, try_kwargs, **kwargs) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\contrib\data\handler.py", line 839, in init super().init( File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\handler.py", line 468, in init super().init(instruments, start_time, end_time, data_loader, kwargs) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\handler.py", line 100, in init self.setup_data() File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\handler.py", line 610, in setup_data super().setup_data(kwargs) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\handler.py", line 144, in setup_data self._data = lazy_sort_index(self.data_loader.load(self.instruments, self.start_time, self.end_time)) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\loader.py", line 135, in load { File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\loader.py", line 136, in grp: self.load_group_df(instruments, exprs, names, start_time, end_time, grp) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\dataset\loader.py", line 217, in load_group_df df = D.features(instruments, exprs, start_time, end_time, freq=freq, inst_processors=inst_processors) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\data.py", line 1191, in features return DatasetD.dataset(instruments, fields, start_time, end_time, freq, inst_processors=inst_processors) File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\data.py", line 924, in dataset data = self.dataset_processor( File "E:\anaconda3\envs\qlib230510\lib\site-packages\qlib\data\data.py", line 578, in dataset_processor ParallelExt(n_jobs=workers, backend=C.joblib_backend, maxtasksperchild=C.maxtasksperchild)(task_l), File "E:\anaconda3\envs\qlib230510\lib\site-packages\joblib\parallel.py", line 1098, in call__ self.retrieve() File "E:\anaconda3\envs\qlib230510\lib\site-packages\joblib\parallel.py", line 975, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "E:\anaconda3\envs\qlib230510\lib\multiprocessing\pool.py", line 771, in get raise self._value KeyError: 'Unknown memcache unit'

qianyun210603 commented 1 year ago

I didn't recently use those cs operators for a while. Below I would provide some hints for you if you wish to debug:

First do not use rolling, try with a clean, single workflow run to see if the issue occurs. Rolling engages many cache mechanism which I didn't test the XSections against (actually this rolling function is added by MS team after I develop the XSection ops).

The KeyError: 'Unknown memcache unit' raises in MemCache in qlib/data/cache.py.

def __getitem__(self, key):
    if key == "c":
        return self.__calendar_mem_cache
    elif key == "i":
        return self.__instrument_mem_cache
    elif key == "f":
        return self.__feature_mem_cache
    elif key == "fs":
        return self._feature_share_mem_cache
    else:
        raise KeyError(f"Unknown memcache unit {key}")

I'll suggest you to put a breakpoint there and see what key causes this error. Typically if you didn't mess up the code, it shouldn't ask for a non-existing cache type. Even the shared memory cache fs is not properly initialised (in qlib/data/data.py), it should return some NoneType not subscribe error instead a key error. So I'm curious what key you'll see at the breakpoint.

quant2008 commented 11 months ago

您好，如上是我的代码，您看CSRank的调用写法是否正常。出错后，我打印key的值是fs. 然后，我发现新版qlib没有fs这个判断分支了，如下，新版MemCache与你的老版不太一样，不知该如何改动，才能使用您的CSRank？：

quant2008 commented 11 months ago

对了，我是用最新的ms的qlib，把你的CSRank拷贝过去，跑程序的，也许这样不合适。可能要直接用你的qlib跑才对，我后面试试

quant2008 commented 11 months ago

这次我安装了你的qlib，运行上述代码，出现如下错误，请问可以解决吗？：

qianyun210603 commented 11 months ago

你先确定下用的是我最新的main分支，如果还是不行把你的配置发我下。你是用的官方的yahoo数据么？

qianyun210603 commented 11 months ago

fs我这里一直都有，本来就是为了CrossSection加的，官方的一直没有。

https://github.com/qianyun210603/qlib/blob/9879fff33154db0093c6255bb484b6f2621efcd6/qlib/data/cache.py#L201-L212

quant2008 commented 11 months ago

是用了你的main。运行的代码如下：

import qlib
from qlib.data.dataset.loader import QlibDataLoader

if __name__ == "__main__":
    qlib.init(provider_uri=r"G:\qlibrolling\qlib_data\cn_data", region="cn")

    fields = ['CSRank($close)', 'Abs($close)']  # 
    names = ['CSRank', 'close'] # 

    labels = ['Ref($close, -2)/Ref($close, -1) - 1']  # label
    label_names = ['LABEL']

    data_loader_config = {
    "feature": (fields, names),
    "label": (labels, label_names)
    }

    data_loader = QlibDataLoader(config=data_loader_config)
    df = data_loader.load(instruments='csi300', start_time='2017-01-01', end_time='2017-12-31')

    print(df)

quant2008 commented 11 months ago

数据是qlib自带的，如果fields里去掉CSRank字段，则以上代码能正确输出

qianyun210603 commented 11 months ago

看了下，这个应该是系统的问题。我是在linux下开发的，linux里面python类变量能跨进程传送，但windows下面不行。

因为qlib原生的计算是按股票并行的，所以想算截面的话，需要加锁好同步各个股票的数据。我的设计是每个截面因子一个锁这样不同截面因子不会互相干扰。但这样需要把锁放在一个dict里面传到各个进程，这点Windows怎么也做不到，无论是函数传参还是作为类变量。这个还得继续研究看有啥方案。

你要是一定要试试Windows的话，可以把这个锁改成全局锁，即所有截面因子公用一个锁，但肯定会慢。你要改的话，主要需要看下这两个地方 1. https://github.com/qianyun210603/qlib/blob/741c3f78f6f42592ed3cd4a6feebfeb205a62d53/qlib/data/cache.py#L148-L158 2. https://github.com/qianyun210603/qlib/blob/741c3f78f6f42592ed3cd4a6feebfeb205a62d53/qlib/data/ops.py#L2044-L2089

把locks从RLock的dict改成一个RLock，然后去掉索引。

不过就算你改了之后也会报个新错误，说缺少SH600074的key，这个纯是因为原始数据里面这个股票数据就是缺失的。

quant2008 commented 11 months ago

这样啊。谢谢您。看样子qlib截面因子是有问题。

timerobin commented 3 months ago

@qianyun210603 您好，我在qlib中添加CSRank遇到了相同的问题： KeyError: 'Unknown memcache unit'，然后我添加了

elif key == "fs": return self._feature_share_mem_cache
对应的self._feature_share_mem_cache
class SharedMemCacheUnit，

现在的报错如下，我尝试了不带CSRank的第101号因子，能跑通，但是带CSRank的则会报错，我的环境是Linux

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 

Traceback (most recent call last):
  File "/home/hyx/code/qlib/qlib/data/data.py", line 1186, in features
    return DatasetD.dataset(
TypeError: dataset() got multiple values for argument 'inst_processors'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/_utils.py", line 72, in __call__
    return self.func(**kwargs)
  File "/home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py", line 598, in __call__
    return [func(*args, **kwargs)
  File "/home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py", line 598, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/hyx/code/qlib/qlib/data/data.py", line 615, in inst_calculator
    obj[field] = ExpressionD.expression(inst, field, start_time, end_time, freq)
  File "/home/hyx/code/qlib/qlib/data/data.py", line 859, in expression
    series = expression.load(instrument, query_start, query_end, freq)
  File "/home/hyx/code/qlib/qlib/data/base.py", line 193, in load
    series = self._load_internal(instrument, start_index, end_index, *args)
  File "/home/hyx/code/qlib/qlib/data/ops.py", line 306, in _load_internal
    series_left = self.feature_left.load(instrument, start_index, end_index, *args)
  File "/home/hyx/code/qlib/qlib/data/base.py", line 193, in load
    series = self._load_internal(instrument, start_index, end_index, *args)
  File "/home/hyx/code/qlib/qlib/data/ops.py", line 1542, in _load_internal
    if cache_key not in H["fs"]:
TypeError: argument of type 'NoneType' is not iterable
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[1], line 215
    210 data_loader_config = {
    211     "feature": (fields, names),
    212     "label": (labels, label_names)
    213 }
    214 data_loader = QlibDataLoader(config=data_loader_config)
--> 215 df_feature = data_loader.load(instruments=market, start_time=start_time, end_time=end_time)
    218 # 处理器配置
    219 _DEFAULT_LEARN_PROCESSORS_riskfree = [
    220     {"class": "CSZScoreNorm", "kwargs": {"fields_group": "feature"}},
    221     {"class": "CSZScoreNorm", "kwargs": {"fields_group": "label"}},
   (...)
    224     {"class": "DropnaProcessor", "kwargs": {"fields_group": "feature"}},
    225 ]

File ~/code/qlib/qlib/data/dataset/loader.py:141, in DLWParser.load(self, instruments, start_time, end_time)
    138 def load(self, instruments=None, start_time=None, end_time=None) -> pd.DataFrame:
    139     if self.is_group:
    140         df = pd.concat(
--> 141             {
    142                 grp: self.load_group_df(instruments, exprs, names, start_time, end_time, grp)
    143                 for grp, (exprs, names) in self.fields.items()
    144             },
    145             axis=1,
    146         )
    147     else:
    148         exprs, names = self.fields

File ~/code/qlib/qlib/data/dataset/loader.py:142, in <dictcomp>(.0)
    138 def load(self, instruments=None, start_time=None, end_time=None) -> pd.DataFrame:
    139     if self.is_group:
    140         df = pd.concat(
    141             {
--> 142                 grp: self.load_group_df(instruments, exprs, names, start_time, end_time, grp)
    143                 for grp, (exprs, names) in self.fields.items()
    144             },
    145             axis=1,
    146         )
    147     else:
    148         exprs, names = self.fields

File ~/code/qlib/qlib/data/dataset/loader.py:223, in QlibDataLoader.load_group_df(self, instruments, exprs, names, start_time, end_time, gp_name)
    219 freq = self.freq[gp_name] if isinstance(self.freq, dict) else self.freq
    220 inst_processors = (
    221     self.inst_processors if isinstance(self.inst_processors, list) else self.inst_processors.get(gp_name, [])
    222 )
--> 223 df = D.features(instruments, exprs, start_time, end_time, freq=freq, inst_processors=inst_processors)
    224 df.columns = names
    225 if self.swap_level:

File ~/code/qlib/qlib/data/data.py:1190, in BaseProvider.features(self, instruments, fields, start_time, end_time, freq, disk_cache, inst_processors)
   1186     return DatasetD.dataset(
   1187         instruments, fields, start_time, end_time, freq, disk_cache, inst_processors=inst_processors
   1188     )
   1189 except TypeError:
-> 1190     return DatasetD.dataset(instruments, fields, start_time, end_time, freq, inst_processors=inst_processors)

File ~/code/qlib/qlib/data/data.py:923, in LocalDatasetProvider.dataset(self, instruments, fields, start_time, end_time, freq, inst_processors)
    921     start_time = cal[0]
    922     end_time = cal[-1]
--> 923 data = self.dataset_processor(
    924     instruments_d, column_names, start_time, end_time, freq, inst_processors=inst_processors
    925 )
    927 return data

File ~/code/qlib/qlib/data/data.py:577, in DatasetProvider.dataset_processor(instruments_d, column_names, start_time, end_time, freq, inst_processors)
    567     inst_l.append(inst)
    568     task_l.append(
    569         delayed(DatasetProvider.inst_calculator)(
    570             inst, start_time, end_time, freq, normalize_column_names, spans, C, inst_processors
    571         )
    572     )
    574 data = dict(
    575     zip(
    576         inst_l,
--> 577         ParallelExt(n_jobs=workers, backend=C.joblib_backend, maxtasksperchild=C.maxtasksperchild)(task_l),
    578     )
    579 )
    581 new_data = dict()
    582 for inst in sorted(data.keys()):

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:2007, in Parallel.__call__(self, iterable)
   2001 # The first item from the output is blank, but it makes the interpreter
   2002 # progress until it enters the Try/Except block of the generator and
   2003 # reaches the first `yield` statement. This starts the asynchronous
   2004 # dispatch of the tasks to the workers.
   2005 next(output)
-> 2007 return output if self.return_generator else list(output)

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:1650, in Parallel._get_outputs(self, iterator, pre_dispatch)
   1647     yield
   1649     with self._backend.retrieval_context():
-> 1650         yield from self._retrieve()
   1652 except GeneratorExit:
   1653     # The generator has been garbage collected before being fully
   1654     # consumed. This aborts the remaining tasks if possible and warn
   1655     # the user if necessary.
   1656     self._exception = True

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:1754, in Parallel._retrieve(self)
   1747 while self._wait_retrieval():
   1748 
   1749     # If the callback thread of a worker has signaled that its task
   1750     # triggered an exception, or if the retrieval loop has raised an
   1751     # exception (e.g. `GeneratorExit`), exit the loop and surface the
   1752     # worker traceback.
   1753     if self._aborting:
-> 1754         self._raise_error_fast()
   1755         break
   1757     # If the next job is not ready for retrieval yet, we just wait for
   1758     # async callbacks to progress.

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:1789, in Parallel._raise_error_fast(self)
   1785 # If this error job exists, immediately raise the error by
   1786 # calling get_result. This job might not exists if abort has been
   1787 # called directly or if the generator is gc'ed.
   1788 if error_job is not None:
-> 1789     error_job.get_result(self.timeout)

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:745, in BatchCompletionCallBack.get_result(self, timeout)
    739 backend = self.parallel._backend
    741 if backend.supports_retrieve_callback:
    742     # We assume that the result has already been retrieved by the
    743     # callback thread, and is stored internally. It's just waiting to
    744     # be returned.
--> 745     return self._return_or_raise()
    747 # For other backends, the main thread needs to run the retrieval step.
    748 try:

File /home/hyx/bash/envs/qlib3/lib/python3.8/site-packages/joblib/parallel.py:763, in BatchCompletionCallBack._return_or_raise(self)
    761 try:
    762     if self.status == TASK_ERROR:
--> 763         raise self._result
    764     return self._result
    765 finally:

TypeError: argument of type 'NoneType' is not iterable

我的代码如下：

import qlib
import pandas as pd
import numpy as np
from qlib.constant import REG_US
from qlib.utils import exists_qlib_data, init_instance_by_config
from qlib.workflow import R
from qlib.workflow.record_temp import SignalRecord, PortAnaRecord,SigAnaRecord
from qlib.utils import flatten_dict
import pylab as pl
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from qlib.data.dataset.handler import DataHandlerLP
from qlib.data.dataset.loader import QlibDataLoader
provider_uri = "/home/hyx/qlib_data/us/" 
market = "all"
start_time = '2012-01-01'
end_time = '2022-12-31'

qlib.init(provider_uri=provider_uri, region=REG_US)

f_return = "($close/Ref($close, 1)-1)"
f_adv5 = "Mean($money, 5)"
f_adv10 = "Mean($money, 10)"
f_adv15 = "Mean($money, 15)"
f_adv20 = "Mean($money, 20)"
f_adv30 = "Mean($money, 30)"
f_adv40 = "Mean($money, 40)"
f_adv50 = "Mean($money, 50)"
f_adv60 = "Mean($money, 60)"
f_adv120 = "Mean($money, 120)"
f_adv180 = "Mean($money, 180)"

alpha_components = {
    "alpha001": f"CSRank(IdxMax(Power(If({f_return}<0, Std({f_return}, 20), $close), 2), 5))-0.5",
}

figurefilepath = '/home/hyx/code/qlib/output/FormulaAlpha/'
sharpe_values = {}
alpha_name = 'alpha004'
fields = [alpha_components[alpha_name]] # MACD
names = [alpha_name]
labels = ['Ref($close, -11)/Ref($close, -1) - 1'] # label
label_names = ['LABEL']
data_loader_config = {
    "feature": (fields, names),
    "label": (labels, label_names)
}
data_loader = QlibDataLoader(config=data_loader_config)
df_feature = data_loader.load(instruments=market, start_time=start_time, end_time=end_time)

qianyun210603 commented 3 months ago

@timerobin 首先说清楚你在什么基础上改的，是官方的Qlib还是我这个的main分支。我现在不大搞因子了。印象里当时改支持截面的时候改的地方不止一两个文件，还有配置文件也要改。

timerobin commented 3 months ago

@qianyun210603 您好，我在官方的Qlib的基础上改的，在qlib.data.ops.py加入了CSRank,CSScale,XSectionOperator，请问配置文件指哪些呀

qianyun210603 / qlib

where are Cross Sectional Factors? #1