microsoft / qlib

Qlib is an AI-oriented quantitative investment platform that aims to realize the potential, empower research, and create value using AI technologies in quantitative investment, from exploring ideas to implementing productions. Qlib supports diverse machine learning modeling paradigms. including supervised learning, market dynamics modeling, and RL.
https://qlib.readthedocs.io/en/latest/
MIT License
15.47k stars 2.64k forks source link

Qlib run into error by joblib or multiprocessing #842

Closed hensejiang closed 2 years ago

hensejiang commented 2 years ago

❓ Questions and Help

I was trying to develop some alpha following the doc at https://qlib.readthedocs.io/en/latest/advanced/alpha.html The code was pasted from the example as:

if name == 'main': from qlib.data.dataset.loader import QlibDataLoader`

MACD_EXP = '(EMA($close, 12) - EMA($close, 26))/$close - EMA((EMA($close, 12) - EMA($close, 26))/$close, 9)/$close'`

  close_EXP = '($close)'`
  fields = [close_EXP] # close itself
  names = ['close_itself']
  print('label test is trying to set label')
  labels = ['((Ref($close, -3) + Ref($close, -2) + Ref($close, -1)) / 3) / Ref($close) - 1'] # set label to IDC
  label_names = ['LABEL_IDC']
  data_loader_config = {
      "feature": (fields, names),
      "label": (labels, label_names),
  }
  print('label test is trying to load data')
  data_loader = QlibDataLoader(config=data_loader_config)
  df = data_loader.load(instruments='csi300', start_time='2010-01-01', end_time='2010-01-31')
  df.to_csv(os.path.join(model_saving_uri,'label_mean-3_try.csv'))
  print(df)
  print('label test data is done')

print('done...')

Result from running this simple test is a long error message:

TypeError Traceback (most recent call last) ~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset__init.py in init__(self, handler, segments, **kwargs) 113 } 114 """ --> 115 self.handler: DataHandler = init_instance_by_config(handler, accept_types=DataHandler) 116 self.segments = segments.copy() 117 self.fetch_kwargs = {}

~\AppData\Roaming\Python\Python38\site-packages\qlib\utils__init__.py in init_instance_by_config(config, default_module, accept_types, try_kwargs, kwargs) 334 # 1: XXX() got multiple values for keyword argument 'YYY' 335 # 2: `XXX() got an unexpected keyword argument 'YYY' --> 336 return klass(cls_kwargs, **kwargs) 337 338

~\AppData\Roaming\Python\Python38\site-packages\qlib\contrib\data\handler.py in init(self, instruments, start_time, end_time, freq, infer_processors, learn_processors, fit_start_time, fit_end_time, filter_pipe, inst_processor, **kwargs) 78 } 79 ---> 80 super().init( 81 instruments=instruments, 82 start_time=start_time,

~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\handler.py in init(self, instruments, start_time, end_time, data_loader, infer_processors, learn_processors, shared_processors, process_type, drop_raw, kwargs) 387 self.process_type = process_type 388 self.drop_raw = drop_raw --> 389 super().init(instruments, start_time, end_time, data_loader, kwargs) 390 391 def get_all_processors(self):

~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\handler.py in init(self, instruments, start_time, end_time, data_loader, init_data, fetch_orig) 103 if init_data: 104 with TimeInspector.logt("Init data"): --> 105 self.setup_data() 106 super().init() 107

~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\handler.py in setup_data(self, init_type, kwargs) 523 """ 524 # init raw data --> 525 super().setup_data(kwargs) 526 527 with TimeInspector.logt("fit & process data"):

~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\handler.py in setup_data(self, enable_cache) 147 with TimeInspector.logt("Loading data"): 148 # make sure the fetch method is based on a index-sorted pd.DataFrame --> 149 self._data = lazy_sort_index(self.data_loader.load(self.instruments, self.start_time, self.end_time)) 150 # TODO: cache 151

~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\loader.py in load(self, instruments, start_time, end_time) 135 if self.is_group: 136 df = pd.concat( --> 137 { 138 grp: self.load_group_df(instruments, exprs, names, start_time, end_time, grp) 139 for grp, (exprs, names) in self.fields.items()

~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\loader.py in (.0) 136 df = pd.concat( 137 { --> 138 grp: self.load_group_df(instruments, exprs, names, start_time, end_time, grp) 139 for grp, (exprs, names) in self.fields.items() 140 },

~\AppData\Roaming\Python\Python38\site-packages\qlib\data\dataset\loader.py in load_group_df(self, instruments, exprs, names, start_time, end_time, gp_name) 211 212 freq = self.freq[gp_name] if isinstance(self.freq, dict) else self.freq --> 213 df = D.features( 214 instruments, exprs, start_time, end_time, freq=freq, inst_processors=self.inst_processor.get(gp_name, []) 215 )

~\AppData\Roaming\Python\Python38\site-packages\qlib\data\data.py in features(self, instruments, fields, start_time, end_time, freq, disk_cache, inst_processors) 1014 ) 1015 except TypeError: -> 1016 return DatasetD.dataset(instruments, fields, start_time, end_time, freq, inst_processors=inst_processors) 1017 1018

~\AppData\Roaming\Python\Python38\site-packages\qlib\data\data.py in dataset(self, instruments, fields, start_time, end_time, freq, inst_processors) 749 end_time = cal[-1] 750 --> 751 data = self.dataset_processor( 752 instruments_d, column_names, start_time, end_time, freq, inst_processors=inst_processors 753 )

~\AppData\Roaming\Python\Python38\site-packages\qlib\data\data.py in dataset_processor(instruments_d, column_names, start_time, end_time, freq, inst_processors) 522 zip( 523 inst_l, --> 524 ParallelExt(n_jobs=workers, backend=C.joblib_backend, maxtasksperchild=C.maxtasksperchild)(task_l), 525 ) 526 )

~\AppData\Roaming\Python\Python38\site-packages\joblib\parallel.py in call(self, iterable) 1054 1055 with self._backend.retrieval_context(): -> 1056 self.retrieve() 1057 # Make sure that we get a last message telling us we are done 1058 elapsed_time = time.time() - self._start_time

~\AppData\Roaming\Python\Python38\site-packages\joblib\parallel.py in retrieve(self) 933 try: 934 if getattr(self._backend, 'supports_timeout', False): --> 935 self._output.extend(job.get(timeout=self.timeout)) 936 else: 937 self._output.extend(job.get())

~.conda\envs\pt\lib\multiprocessing\pool.py in get(self, timeout) 769 return self._value 770 else: --> 771 raise self._value 772 773 def _set(self, i, obj):

TypeError: init() missing 1 required positional argument: 'N'

It seems the error lies in involking multiprocessing from joblib. I can't handle an error so deep down there, could someone look into this?

From searching on web I got a possible reason that somewhere in a package is missing a 'self' in the init() function. While I can't tell it's from the code of Qlib or some other place. My own code was too simple to directly call any initiation function except the dataloaderlp and it appears with enough parameters sent in 'config' argument.

Thanks.

hensejiang commented 2 years ago

From searching on web I got a possible reason that somewhere in a package is missing a 'self' in the init() function. While I can't tell it's from the code of Qlib or some other place. My own code was too simple to directly call any initiation function except the dataloaderlp and it appears with enough parameters sent in 'config' argument.

hensejiang commented 2 years ago

Team: could it be possible that the problem was actually due to data? Any hints you may have and share? thanks.

Wangwuyi123 commented 2 years ago

@hensejiang I think you missed the initialization step. image

The specific is in if __name__ = 'main': qlib.init(provider_uri='./qlib_data/cn_data', region=REG_CN)

You can try it first, your issue is too vague.

hensejiang commented 2 years ago

I have the ‘init’, thought it was not so relevant so skipped some lines. The full cell is:

import os import qlib from qlib.config import REG_CN import pandas as pd from qlib.data import D, base HISTORICAL_DATA_START = '2010-01-01' HISTORICAL_DATA_END = '2021-06-11' SET_MARKET = "csi300" benchmark = "SH000300"

import platform WIN_BASE = 'x:/qisolution/qlib_1' LINUX_BASE = '/mnt/x/qisolution/qlib_1' base_folder = LINUX_BASE if platform.system() == 'Linux' else WIN_BASE

provider_uri = os.path.join(base_folder, 'qlib_data/cn_data') scripts_dir = os.path.join(base_folder, 'scripts') source_uri = os.path.join(base_folder, 'qlib_data/source_data') model_saving_uri = os.path.join(base_folder,'qlib_model_saving')

qlib.init(provider_uri=provider_uri, region=REG_CN) print('qlib initialized. now moving to next step')

if name == 'main':

from qlib.data.dataset.loader import QlibDataLoader
#MACD_EXP = '(EMA($close, 12) - EMA($close, 26))/$close - EMA((EMA($close, 12) - EMA($close, 26))/$close, 9)/$close'
close_EXP = '$close'
fields = ['$close'] # close itself ['$open','$high','$low', '$close', '$volume', '$factor']
names = ['close'] #['open','high','low', 'close', 'volume', 'factor']
print('label test is trying to set label')
labels = ['((Ref($close, -3) + Ref($close, -2) + Ref($close, -1)) / 3) / Ref($close) - 1'] # set label to IDC
label_names = ['LABEL_IDC']
data_loader_config = {
    "feature": (fields, names),
    "label": (labels, label_names),
}
print('label test is trying to load data')
data_loader = QlibDataLoader(config=data_loader_config)
df = data_loader.load(instruments='csi300', start_time='2010-01-01', end_time='2010-01-31')
df.to_csv(os.path.join(model_saving_uri,'label_mean-3_try.csv'))
print(df)
print('label test data is done')

print('done...')

I’m running this on Windows 10, python 3.8.12 Thank you for the reply.

Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows

From: @.> Sent: Wednesday, January 12, 2022 9:40 AM To: @.> Cc: @.>; @.> Subject: Re: [microsoft/qlib] Qlib run into error by joblib or multiprocessing (Issue #842)

@hensejianghttps://github.com/hensejiang I think you missed the initialization step. [image]https://user-images.githubusercontent.com/51237097/149048343-243f5590-fec6-47a5-b999-ab39468b99c3.png

The specific is in if name = 'main': qlib.init(provider_uri='./qlib_data/cn_data', region=REG_CN)

You can try it first, your issue is too vague.

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/qlib/issues/842#issuecomment-1010544076, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKI2757M6OHNSRVI47ZNGSTUVTLW7ANCNFSM5LVJRG2A. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>

Wangwuyi123 commented 2 years ago

There is some confusion in your talkback, I have reproduced a mistake according to the script you provided image Please see if it's the same as your question.

hensejiang commented 2 years ago

Seems we’ve hit the same wall. I’d appreciate if Qlib team could resolve this issue to make a great tool, thanks for your and the team’s effort. Will keep tracking this issue, please update with any progress.

Btw the code overthere has another issue: in qlib.data.data, class BaseProvider, the ‘feature’ function:

below is source code from Qlib site package

fields = list(fields) # In case of tuple. try: return DatasetD.dataset( instruments, fields, start_time, end_time, freq, disk_cache, inst_processors=inst_processors ) except TypeError: return DatasetD.dataset(instruments, fields, start_time, end_time, freq, inst_processors=inst_processors)

During my debugging I found these two returns could got confused by the ‘fields’ list when there were multiple items in it. I thus explitely match each arguments with inputs for DatasetD.dataset() and it did the job.

below is modification by Hense

fields = list(fields) # In case of tuple. try: return DatasetD.dataset( instruments=instruments, fields=fields,start_time= start_time,end_time= end_time,freq= freq,disk_cache= disk_cache, inst_processors=inst_processors ) except TypeError: return DatasetD.dataset(instruments=instruments,fields=fields,start_time= start_time,end_time= end_time,freq= freq, inst_processors=inst_processors)

hensejiang commented 2 years ago

Found out this was caused by a typo of my own, it should be $close but not Ref($close) in the expression

labels  = ['((Ref($close, -3) + Ref($close, -2) + Ref($close, -1) ) /3) / $close - 1'] # Correct
#labels = ['((Ref($close, -3) + Ref($close, -2) + Ref($close, -1) ) /3) / Ref($close) - 1'] # My previous wrong expression
label_names = ['LABEL_IDC']
data_loader_config = {
    "feature": (fields, names),
    "label": (labels, label_names),
}

Thanks to Wang anyway, your response encouraged me to dig this out.

Other issues: 1: the Mean which is one of the ops, as from Qlib doc at https://qlib.readthedocs.io/en/latest/reference/api.html#module-qlib.data.filter From the doc there is the description of Mean and example codes of usage like Mean($close, 5). But I found it does not support minus value for a rolling window for futures which seems an easy integration for the other ops Ref: @.*** If I set the label by Mean like: labels = ['Mean($close, -3) / $close - 1'] There will be error: ValueError: min_periods 1 must be <= window -3 Thus I have to run the long expression as above, looks ugly but does the job.

2: the label_name argu in Qlib.DataLoader.loader() does Not work on Linux but works fine on Windows. Below is what I got on Ubuntu with the ugly-while-correct label expression: feature ... label $open $high ... $factor ((Ref($close, -3) + Ref($close, -2) + Ref($close, -1) ) /3) / $close - 1 datetime instrument ... 2010-01-05 SH600015 1.007329 1.029316 ... NaN -0.047657 2010-01-06 SH600015 1.020358 1.021987 ... NaN -0.013465 2010-01-07 SH600015 0.987785 0.997557 ... NaN 0.022166 2010-01-08 SH600015 0.960912 0.980456 ... NaN 0.005017

Now see how nicely it is on Win10 by same code:

                    feature                                                                                           label
                       open      high       low     close       volume factor LABEL_IDC

datetime instrument 2010-01-05 SH600015 1.007329 1.029316 0.985342 1.025244 726193472.0 NaN -0.047657 2010-01-06 SH600015 1.020358 1.021987 0.986156 0.987785 667949696.0 NaN -0.013465 2010-01-07 SH600015 0.987785 0.997557 0.962541 0.967427 628927296.0 NaN 0.022166 2010-01-08 SH600015 0.960912 0.980456 0.952769 0.973941 348927840.0 NaN 0.005017

It not a big problem, just a bit curious about how different the two systems process data.

xmm1016 commented 2 years ago

I have had a similar problem. I ran my model with "qrun my.yaml". The features are based on the 30-min frequency data while the label is based on the daily frequency data. Some parameters in the config file are set as follows: `qlib_init: provider_uri: day: "~/.qlib/qlib_data/my_data" 30min: "~/.qlib/qlib_data/my_min_data/30min" region: cn dataset_cache: null maxtasksperchild: 1 market: &market all benchmark: &benchmark SH000905

data_handler_config: &data_handler_config start_time: 2018-01-01

1min closing time is 15:00:00

end_time: 2021-11-30 15:00:00
fit_start_time: 2018-01-01
fit_end_time: 2020-12-31
instruments: *market
freq:
    label: day
    feature: 30min
infer_processors:
    - class: RobustZScoreNorm
      kwargs:
          fields_group: feature
          clip_outlier: true
    - class: Fillna
      kwargs:
          fields_group: feature
learn_processors:
    - class: DropnaLabel
    - class: CSRankNorm
      kwargs:
          fields_group: label
# with label as reference
inst_processor:
    feature:
        - class: ResampleNProcessor
          module_path: features_resample_N.py
          kwargs:
              target_frq: 30min

dataset: class: TSDatasetH module_path: qlib.data.dataset kwargs: handler: class: Alpha158 module_path: qlib.contrib.data.handler kwargs: *data_handler_config segments: train: [2018-01-01, 2019-12-31] valid: [2020-01-01, 2020-12-31] test: [2021-01-01, 2021-11-30]`

The program ran successfully except that the train loss did not descend during the training process. I guess that the rolling window settings maybe not suitable for the features based on the 30-min frequency data, so I changed the package file, qlib/contrib/data/handler.py, line 249, in parse_config_to_fields, windows = config["rolling"].get("windows", [8, 16, 32, 64, 128]) to improve the model performance. As a result, the program ran an error. At the beginning, the results reported a very long warning message. Part of the results are shown:

491642581986_ pic_hd

Then there is a long error message:

501642582520_ pic_hd

511642582605_ pic_hd

According to the warning message, I track the error to Qlib package, qlib/data/ops.py, line 740, in method 'get_extended_window_size' of class 'Rolling'. It seems like that self.N == 0 leads to this warning message. But I can not understand this result because the definition of fields remain unchanged. It doesn't make sense that rolling window settings will influence the self.N.

Then by following up the clue provided in the error message, I track the error to Qlib package, qlib/data/data.py, line 554, in dataset_processor, ParallelExt(n_jobs=workers, backend=C.joblib_backend, maxtasksperchild=C.maxtasksperchild)(task_l). It's an error relating to multiprocessing from joblib. After checking the tasks executed by multiprocessing, I orientate the problem in qlib/data/cache.py, line 53, in method 'setitem' in class 'MemCacheUnit'. There is a note, '# TODO: thread safe?setitem failure might cause inconsistent size? '. I guess that this piece of code, self._adjust_size(key, value), cause an error during simultaneous execution by multiple threads, but I don't know how to fix it. Besides it's weird that everything is ok when using the original rolling window settings while this piece of code begins to funciton as rolling window settings change.

Could someone explain this?

Thanks.

hensejiang commented 2 years ago

Encountered a similar error this morning, by setting something unusual to data processing. We both debugged into the same level and saw the same info, only that I knew clearly what changes I made to trigger this: the data process. Suggest you post here how you manipulated the data, and focus on your self-created part, such as features or labels, so that the team could better help you.

xmm1016 commented 2 years ago

@hensejiang There are two changes I made in data processing. One is the frequency of data used to calculate features. I try a similar experiment on features based on the daily frequency data. Same error messages are reported. Thus, the data frequency doesn't matter. The another change I made in the data process is in qlib/contrib/data/handler.py, line 249, in method 'parse_config_to_fields', windows = config["rolling"].get("windows", [8, 16, 32, 64, 128]). Everything is ok when the rolling windows are initial settings, windows = config["rolling"].get("windows", [5, 10, 20, 30, 60]). I guess this change trigger an error.

you-n-g commented 2 years ago

It will be very helpful to set kernels to 1 to disable multiprocessing for debugging. You can try it and get more details about the exception.

https://github.com/microsoft/qlib/pull/880/files#diff-84832c55fadb4ced122bbbca6e3a0be0d83b043a977b0f364116d87834e6d2a1R93

xmm1016 commented 2 years ago

I made a spelling mistake. As a result, the variable 'windows' holds the previous definition, which contains a value of 0 and leads to the warning message. Thanks for your suggestion about kernels.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for three months with no activity. Remove the stale label or comment on the issue otherwise this will be closed in 5 days