Closed hensejiang closed 2 years ago
From searching on web I got a possible reason that somewhere in a package is missing a 'self' in the init() function. While I can't tell it's from the code of Qlib or some other place. My own code was too simple to directly call any initiation function except the dataloaderlp and it appears with enough parameters sent in 'config' argument.
Team: could it be possible that the problem was actually due to data? Any hints you may have and share? thanks.
@hensejiang I think you missed the initialization step.
The specific is in if __name__ = 'main'
:
qlib.init(provider_uri='./qlib_data/cn_data', region=REG_CN)
You can try it first, your issue is too vague.
I have the ‘init’, thought it was not so relevant so skipped some lines. The full cell is:
import os import qlib from qlib.config import REG_CN import pandas as pd from qlib.data import D, base HISTORICAL_DATA_START = '2010-01-01' HISTORICAL_DATA_END = '2021-06-11' SET_MARKET = "csi300" benchmark = "SH000300"
import platform WIN_BASE = 'x:/qisolution/qlib_1' LINUX_BASE = '/mnt/x/qisolution/qlib_1' base_folder = LINUX_BASE if platform.system() == 'Linux' else WIN_BASE
provider_uri = os.path.join(base_folder, 'qlib_data/cn_data') scripts_dir = os.path.join(base_folder, 'scripts') source_uri = os.path.join(base_folder, 'qlib_data/source_data') model_saving_uri = os.path.join(base_folder,'qlib_model_saving')
qlib.init(provider_uri=provider_uri, region=REG_CN) print('qlib initialized. now moving to next step')
if name == 'main':
from qlib.data.dataset.loader import QlibDataLoader
#MACD_EXP = '(EMA($close, 12) - EMA($close, 26))/$close - EMA((EMA($close, 12) - EMA($close, 26))/$close, 9)/$close'
close_EXP = '$close'
fields = ['$close'] # close itself ['$open','$high','$low', '$close', '$volume', '$factor']
names = ['close'] #['open','high','low', 'close', 'volume', 'factor']
print('label test is trying to set label')
labels = ['((Ref($close, -3) + Ref($close, -2) + Ref($close, -1)) / 3) / Ref($close) - 1'] # set label to IDC
label_names = ['LABEL_IDC']
data_loader_config = {
"feature": (fields, names),
"label": (labels, label_names),
}
print('label test is trying to load data')
data_loader = QlibDataLoader(config=data_loader_config)
df = data_loader.load(instruments='csi300', start_time='2010-01-01', end_time='2010-01-31')
df.to_csv(os.path.join(model_saving_uri,'label_mean-3_try.csv'))
print(df)
print('label test data is done')
print('done...')
I’m running this on Windows 10, python 3.8.12 Thank you for the reply.
Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows
From: @.> Sent: Wednesday, January 12, 2022 9:40 AM To: @.> Cc: @.>; @.> Subject: Re: [microsoft/qlib] Qlib run into error by joblib or multiprocessing (Issue #842)
@hensejianghttps://github.com/hensejiang I think you missed the initialization step. [image]https://user-images.githubusercontent.com/51237097/149048343-243f5590-fec6-47a5-b999-ab39468b99c3.png
The specific is in if name = 'main': qlib.init(provider_uri='./qlib_data/cn_data', region=REG_CN)
You can try it first, your issue is too vague.
— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/qlib/issues/842#issuecomment-1010544076, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKI2757M6OHNSRVI47ZNGSTUVTLW7ANCNFSM5LVJRG2A. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>
There is some confusion in your talkback, I have reproduced a mistake according to the script you provided Please see if it's the same as your question.
Seems we’ve hit the same wall. I’d appreciate if Qlib team could resolve this issue to make a great tool, thanks for your and the team’s effort. Will keep tracking this issue, please update with any progress.
Btw the code overthere has another issue: in qlib.data.data, class BaseProvider, the ‘feature’ function:
fields = list(fields) # In case of tuple. try: return DatasetD.dataset( instruments, fields, start_time, end_time, freq, disk_cache, inst_processors=inst_processors ) except TypeError: return DatasetD.dataset(instruments, fields, start_time, end_time, freq, inst_processors=inst_processors)
During my debugging I found these two returns could got confused by the ‘fields’ list when there were multiple items in it. I thus explitely match each arguments with inputs for DatasetD.dataset() and it did the job.
fields = list(fields) # In case of tuple. try: return DatasetD.dataset( instruments=instruments, fields=fields,start_time= start_time,end_time= end_time,freq= freq,disk_cache= disk_cache, inst_processors=inst_processors ) except TypeError: return DatasetD.dataset(instruments=instruments,fields=fields,start_time= start_time,end_time= end_time,freq= freq, inst_processors=inst_processors)
Found out this was caused by a typo of my own, it should be $close but not Ref($close) in the expression
labels = ['((Ref($close, -3) + Ref($close, -2) + Ref($close, -1) ) /3) / $close - 1'] # Correct
#labels = ['((Ref($close, -3) + Ref($close, -2) + Ref($close, -1) ) /3) / Ref($close) - 1'] # My previous wrong expression
label_names = ['LABEL_IDC']
data_loader_config = {
"feature": (fields, names),
"label": (labels, label_names),
}
Thanks to Wang anyway, your response encouraged me to dig this out.
Other issues: 1: the Mean which is one of the ops, as from Qlib doc at https://qlib.readthedocs.io/en/latest/reference/api.html#module-qlib.data.filter From the doc there is the description of Mean and example codes of usage like Mean($close, 5). But I found it does not support minus value for a rolling window for futures which seems an easy integration for the other ops Ref: @.*** If I set the label by Mean like: labels = ['Mean($close, -3) / $close - 1'] There will be error: ValueError: min_periods 1 must be <= window -3 Thus I have to run the long expression as above, looks ugly but does the job.
2: the label_name argu in Qlib.DataLoader.loader() does Not work on Linux but works fine on Windows. Below is what I got on Ubuntu with the ugly-while-correct label expression: feature ... label $open $high ... $factor ((Ref($close, -3) + Ref($close, -2) + Ref($close, -1) ) /3) / $close - 1 datetime instrument ... 2010-01-05 SH600015 1.007329 1.029316 ... NaN -0.047657 2010-01-06 SH600015 1.020358 1.021987 ... NaN -0.013465 2010-01-07 SH600015 0.987785 0.997557 ... NaN 0.022166 2010-01-08 SH600015 0.960912 0.980456 ... NaN 0.005017
Now see how nicely it is on Win10 by same code:
feature label
open high low close volume factor LABEL_IDC
datetime instrument 2010-01-05 SH600015 1.007329 1.029316 0.985342 1.025244 726193472.0 NaN -0.047657 2010-01-06 SH600015 1.020358 1.021987 0.986156 0.987785 667949696.0 NaN -0.013465 2010-01-07 SH600015 0.987785 0.997557 0.962541 0.967427 628927296.0 NaN 0.022166 2010-01-08 SH600015 0.960912 0.980456 0.952769 0.973941 348927840.0 NaN 0.005017
It not a big problem, just a bit curious about how different the two systems process data.
I have had a similar problem. I ran my model with "qrun my.yaml". The features are based on the 30-min frequency data while the label is based on the daily frequency data. Some parameters in the config file are set as follows: `qlib_init: provider_uri: day: "~/.qlib/qlib_data/my_data" 30min: "~/.qlib/qlib_data/my_min_data/30min" region: cn dataset_cache: null maxtasksperchild: 1 market: &market all benchmark: &benchmark SH000905
data_handler_config: &data_handler_config start_time: 2018-01-01
end_time: 2021-11-30 15:00:00
fit_start_time: 2018-01-01
fit_end_time: 2020-12-31
instruments: *market
freq:
label: day
feature: 30min
infer_processors:
- class: RobustZScoreNorm
kwargs:
fields_group: feature
clip_outlier: true
- class: Fillna
kwargs:
fields_group: feature
learn_processors:
- class: DropnaLabel
- class: CSRankNorm
kwargs:
fields_group: label
# with label as reference
inst_processor:
feature:
- class: ResampleNProcessor
module_path: features_resample_N.py
kwargs:
target_frq: 30min
dataset: class: TSDatasetH module_path: qlib.data.dataset kwargs: handler: class: Alpha158 module_path: qlib.contrib.data.handler kwargs: *data_handler_config segments: train: [2018-01-01, 2019-12-31] valid: [2020-01-01, 2020-12-31] test: [2021-01-01, 2021-11-30]`
The program ran successfully except that the train loss did not descend during the training process. I guess that the rolling window settings maybe not suitable for the features based on the 30-min frequency data, so I changed the package file, qlib/contrib/data/handler.py, line 249, in parse_config_to_fields, windows = config["rolling"].get("windows", [8, 16, 32, 64, 128])
to improve the model performance. As a result, the program ran an error.
At the beginning, the results reported a very long warning message. Part of the results are shown:
Then there is a long error message:
According to the warning message, I track the error to Qlib package, qlib/data/ops.py, line 740, in method 'get_extended_window_size' of class 'Rolling'. It seems like that self.N == 0
leads to this warning message. But I can not understand this result because the definition of fields remain unchanged. It doesn't make sense that rolling window settings will influence the self.N
.
Then by following up the clue provided in the error message, I track the error to Qlib package, qlib/data/data.py, line 554, in dataset_processor, ParallelExt(n_jobs=workers, backend=C.joblib_backend, maxtasksperchild=C.maxtasksperchild)(task_l)
. It's an error relating to multiprocessing from joblib. After checking the tasks executed by multiprocessing, I orientate the problem in qlib/data/cache.py, line 53, in method 'setitem' in class 'MemCacheUnit'. There is a note, '# TODO: thread safe?setitem failure might cause inconsistent size? '. I guess that this piece of code, self._adjust_size(key, value)
, cause an error during simultaneous execution by multiple threads, but I don't know how to fix it. Besides it's weird that everything is ok when using the original rolling window settings while this piece of code begins to funciton as rolling window settings change.
Could someone explain this?
Thanks.
Encountered a similar error this morning, by setting something unusual to data processing. We both debugged into the same level and saw the same info, only that I knew clearly what changes I made to trigger this: the data process. Suggest you post here how you manipulated the data, and focus on your self-created part, such as features or labels, so that the team could better help you.
@hensejiang
There are two changes I made in data processing. One is the frequency of data used to calculate features. I try a similar experiment on features based on the daily frequency data. Same error messages are reported. Thus, the data frequency doesn't matter. The another change I made in the data process is in qlib/contrib/data/handler.py, line 249, in method 'parse_config_to_fields', windows = config["rolling"].get("windows", [8, 16, 32, 64, 128])
. Everything is ok when the rolling windows are initial settings, windows = config["rolling"].get("windows", [5, 10, 20, 30, 60])
. I guess this change trigger an error.
It will be very helpful to set kernels to 1 to disable multiprocessing for debugging. You can try it and get more details about the exception.
I made a spelling mistake. As a result, the variable 'windows' holds the previous definition, which contains a value of 0 and leads to the warning message. Thanks for your suggestion about kernels.
This issue is stale because it has been open for three months with no activity. Remove the stale label or comment on the issue otherwise this will be closed in 5 days
❓ Questions and Help
I was trying to develop some alpha following the doc at https://qlib.readthedocs.io/en/latest/advanced/alpha.html The code was pasted from the example as:
if name == 'main': from qlib.data.dataset.loader import QlibDataLoader`
MACD_EXP = '(EMA($close, 12) - EMA($close, 26))/$close - EMA((EMA($close, 12) - EMA($close, 26))/$close, 9)/$close'`
print('done...')
Result from running this simple test is a long error message:
TypeError Traceback (most recent call last) ~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset__init.py in init__(self, handler, segments, **kwargs) 113 } 114 """ --> 115 self.handler: DataHandler = init_instance_by_config(handler, accept_types=DataHandler) 116 self.segments = segments.copy() 117 self.fetch_kwargs = {}
~\AppData\Roaming\Python\Python38\site-packages\qlib\utils__init__.py in init_instance_by_config(config, default_module, accept_types, try_kwargs, kwargs) 334 # 1:
XXX() got multiple values for keyword argument 'YYY'
335 # 2: `XXX() got an unexpected keyword argument 'YYY' --> 336 return klass(cls_kwargs, **kwargs) 337 338~\AppData\Roaming\Python\Python38\site-packages\qlib\contrib\data\handler.py in init(self, instruments, start_time, end_time, freq, infer_processors, learn_processors, fit_start_time, fit_end_time, filter_pipe, inst_processor, **kwargs) 78 } 79 ---> 80 super().init( 81 instruments=instruments, 82 start_time=start_time,
~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\handler.py in init(self, instruments, start_time, end_time, data_loader, infer_processors, learn_processors, shared_processors, process_type, drop_raw, kwargs) 387 self.process_type = process_type 388 self.drop_raw = drop_raw --> 389 super().init(instruments, start_time, end_time, data_loader, kwargs) 390 391 def get_all_processors(self):
~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\handler.py in init(self, instruments, start_time, end_time, data_loader, init_data, fetch_orig) 103 if init_data: 104 with TimeInspector.logt("Init data"): --> 105 self.setup_data() 106 super().init() 107
~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\handler.py in setup_data(self, init_type, kwargs) 523 """ 524 # init raw data --> 525 super().setup_data(kwargs) 526 527 with TimeInspector.logt("fit & process data"):
~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\handler.py in setup_data(self, enable_cache) 147 with TimeInspector.logt("Loading data"): 148 # make sure the fetch method is based on a index-sorted pd.DataFrame --> 149 self._data = lazy_sort_index(self.data_loader.load(self.instruments, self.start_time, self.end_time)) 150 # TODO: cache 151
~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\loader.py in load(self, instruments, start_time, end_time) 135 if self.is_group: 136 df = pd.concat( --> 137 { 138 grp: self.load_group_df(instruments, exprs, names, start_time, end_time, grp) 139 for grp, (exprs, names) in self.fields.items()
~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\loader.py in(.0)
136 df = pd.concat(
137 {
--> 138 grp: self.load_group_df(instruments, exprs, names, start_time, end_time, grp)
139 for grp, (exprs, names) in self.fields.items()
140 },
~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\loader.py in load_group_df(self, instruments, exprs, names, start_time, end_time, gp_name) 211 212 freq = self.freq[gp_name] if isinstance(self.freq, dict) else self.freq --> 213 df = D.features( 214 instruments, exprs, start_time, end_time, freq=freq, inst_processors=self.inst_processor.get(gp_name, []) 215 )
~\AppData\Roaming\Python\Python38\site-packages\qlib\data\data.py in features(self, instruments, fields, start_time, end_time, freq, disk_cache, inst_processors) 1014 ) 1015 except TypeError: -> 1016 return DatasetD.dataset(instruments, fields, start_time, end_time, freq, inst_processors=inst_processors) 1017 1018
~\AppData\Roaming\Python\Python38\site-packages\qlib\data\data.py in dataset(self, instruments, fields, start_time, end_time, freq, inst_processors) 749 end_time = cal[-1] 750 --> 751 data = self.dataset_processor( 752 instruments_d, column_names, start_time, end_time, freq, inst_processors=inst_processors 753 )
~\AppData\Roaming\Python\Python38\site-packages\qlib\data\data.py in dataset_processor(instruments_d, column_names, start_time, end_time, freq, inst_processors) 522 zip( 523 inst_l, --> 524 ParallelExt(n_jobs=workers, backend=C.joblib_backend, maxtasksperchild=C.maxtasksperchild)(task_l), 525 ) 526 )
~\AppData\Roaming\Python\Python38\site-packages\joblib\parallel.py in call(self, iterable) 1054 1055 with self._backend.retrieval_context(): -> 1056 self.retrieve() 1057 # Make sure that we get a last message telling us we are done 1058 elapsed_time = time.time() - self._start_time
~\AppData\Roaming\Python\Python38\site-packages\joblib\parallel.py in retrieve(self) 933 try: 934 if getattr(self._backend, 'supports_timeout', False): --> 935 self._output.extend(job.get(timeout=self.timeout)) 936 else: 937 self._output.extend(job.get())
~.conda\envs\pt\lib\multiprocessing\pool.py in get(self, timeout) 769 return self._value 770 else: --> 771 raise self._value 772 773 def _set(self, i, obj):
TypeError: init() missing 1 required positional argument: 'N'
It seems the error lies in involking multiprocessing from joblib. I can't handle an error so deep down there, could someone look into this?
From searching on web I got a possible reason that somewhere in a package is missing a 'self' in the init() function. While I can't tell it's from the code of Qlib or some other place. My own code was too simple to directly call any initiation function except the dataloaderlp and it appears with enough parameters sent in 'config' argument.
Thanks.