microsoft / qlib

Qlib is an AI-oriented quantitative investment platform that aims to realize the potential, empower research, and create value using AI technologies in quantitative investment, from exploring ideas to implementing productions. Qlib supports diverse machine learning modeling paradigms. including supervised learning, market dynamics modeling, and RL.
https://qlib.readthedocs.io/en/latest/
MIT License
15.53k stars 2.64k forks source link

Same issues with https://github.com/microsoft/qlib/issues/684#issue-1051706048 #985

Closed liujianliuku closed 1 year ago

liujianliuku commented 2 years ago

Hi, I tried to run examples/model_rolling/task_manager_rolling.py and encountered following issues, which is same with the link above.

[4490:MainThread](2022-03-16 23:41:19,948) INFO - qlib.timer - [log.py:113] - Time cost: 2.219s | CSZScoreNorm Done [4490:MainThread](2022-03-16 23:41:19,948) INFO - qlib.timer - [log.py:113] - Time cost: 2.359s | fit & process data Done [4490:MainThread](2022-03-16 23:41:19,948) INFO - qlib.timer - [log.py:113] - Time cost: 6.673s | Init data Done [0] train-rmse:1.05221 valid-rmse:1.05333 [20] train-rmse:0.96462 valid-rmse:0.99454 [40] train-rmse:0.94751 valid-rmse:0.99711 [60] train-rmse:0.93133 valid-rmse:0.99901 [61] train-rmse:0.93027 valid-rmse:0.99892 [4490:MainThread](2022-03-16 23:41:37,447) INFO - qlib.workflow - [record_temp.py:194] - Signal record 'pred.pkl' has been saved as the artifact of the Experiment 1 'The following are prediction results of the XGBModel model.' score datetime instrument
2019-04-09 SH600000 0.031448 SH600009 -0.114521 SH600010 0.133871 SH600011 0.022843 SH600015 -0.006089 {'IC': 0.019224824145083887, 'ICIR': 0.12174991104957546, 'Rank IC': 0.027944959160536458, 'Rank ICIR': 0.19297428591470864} Exception in thread Thread-8: Traceback (most recent call last): File "/home/liujia/anaconda3/envs/qlib/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/liujia/anaconda3/envs/qlib/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/liujia/anaconda3/envs/qlib/lib/python3.7/site-packages/qlib/utils/paral.py", line 88, in run data() File "/home/liujia/anaconda3/envs/qlib/lib/python3.7/site-packages/qlib/workflow/recorder.py", line 371, in log_metrics self.client.log_metric(self.id, name, data, step=step) File "/home/liujia/anaconda3/envs/qlib/lib/python3.7/site-packages/mlflow/tracking/client.py", line 690, in log_metric self._tracking_client.log_metric(run_id, key, value, timestamp, step) File "/home/liujia/anaconda3/envs/qlib/lib/python3.7/site-packages/mlflow/tracking/_tracking_service/client.py", line 233, in log_metric self.store.log_metric(run_id, metric) File "/home/liujia/anaconda3/envs/qlib/lib/python3.7/site-packages/mlflow/store/tracking/file_store.py", line 776, in log_metric run_info = self._get_run_info(run_id) File "/home/liujia/anaconda3/envs/qlib/lib/python3.7/site-packages/mlflow/store/tracking/file_store.py", line 554, in _get_run_info run_info = self._get_run_info_from_dir(run_dir) File "/home/liujia/anaconda3/envs/qlib/lib/python3.7/site-packages/mlflow/store/tracking/file_store.py", line 563, in _get_run_info_from_dir run_info = _read_persisted_run_info_dict(meta) File "/home/liujia/anaconda3/envs/qlib/lib/python3.7/site-packages/mlflow/store/tracking/file_store.py", line 104, in _read_persisted_run_info_dict dict_copy = run_info_dict.copy() AttributeError: 'NoneType' object has no attribute 'copy'

My collector info:

Linux x86_64 Linux-5.13.0-35-generic-x86_64-with-debian-bullseye-sid

40~20.04.1-Ubuntu SMP Mon Mar 7 09:18:32 UTC 2022

Python version: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]

Qlib version: 0.8.4.99 numpy==1.21.5 pandas==1.3.5 scipy==1.7.3 requests==2.27.1 sacred==0.8.2 python-socketio==5.5.2 redis==4.1.4 python-redis-lock==3.7.0 schedule==1.1.0 cvxpy==1.2.0 hyperopt==0.1.2 fire==0.4.0 statsmodels==0.13.2 xlrd==2.0.1 plotly==5.6.0 matplotlib==3.5.1 tables==3.7.0 pyyaml==6.0 mlflow==1.24.0 tqdm==4.63.0 loguru==0.6.0 lightgbm==3.3.2 tornado==6.1 joblib==1.1.0 fire==0.4.0 ruamel.yaml==0.17.21

Qxxxx commented 2 years ago

Hi liujianliuku, I meet the same issue, but it disappeared after I run the same script again

Qxxxx commented 2 years ago

I assume there is some thread unsafe issue in workflow_by_code.py. I will also post my log msg here.

2022-03-20 11:57:23.194 | WARNING  | qlib.tests.data:qlib_data:150 - Data already exists: ~/.qlib/qlib_data/cn_data, the data download will be skipped
        If downloading is required: `exists_skip=False` or `change target_dir`
[10671:MainThread](2022-03-20 11:57:23,195) INFO - qlib.Initialization - [config.py:402] - default_conf: client.
[10671:MainThread](2022-03-20 11:57:23,201) INFO - qlib.Initialization - [__init__.py:73] - qlib successfully initialized based on client settings.
[10671:MainThread](2022-03-20 11:57:23,202) INFO - qlib.Initialization - [__init__.py:75] - data_path={'__DEFAULT_FREQ': PosixPath('/home/q/.qlib/qlib_data/cn_data')}
[10671:MainThread](2022-03-20 11:58:25,445) INFO - qlib.timer - [log.py:113] - Time cost: 59.355s | Loading data Done
[10671:MainThread](2022-03-20 11:58:33,478) INFO - qlib.timer - [log.py:113] - Time cost: 0.272s | DropnaLabel Done
/home/q/FundFusion/env/lib/python3.8/site-packages/pandas/core/frame.py:3641: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
[10671:MainThread](2022-03-20 11:58:38,979) INFO - qlib.timer - [log.py:113] - Time cost: 5.500s | CSZScoreNorm Done
[10671:MainThread](2022-03-20 11:58:38,981) INFO - qlib.timer - [log.py:113] - Time cost: 13.535s | fit & process data Done
[10671:MainThread](2022-03-20 11:58:38,981) INFO - qlib.timer - [log.py:113] - Time cost: 72.892s | Init data Done
                           KMID      KLEN     KMID2       KUP      KUP2      KLOW     KLOW2      KSFT     KSFT2  ...   VSUMN20   VSUMN30   VSUMN60    VSUMD5   VSUMD10   VSUMD20   VSUMD30   VSUMD60    LABEL0
datetime   instrument                                                                                            ...
2008-01-02 SH600000    0.010377  0.061132  0.169754  0.028302  0.462962  0.022453  0.367285  0.004528  0.074077  ...  0.490807  0.485962  0.510337 -0.022349 -0.101278  0.018386  0.028076 -0.020675  0.042068
           SH600004    0.057279  0.059666  0.960000  0.002387  0.040000  0.000000  0.000000  0.054893  0.920001  ...  0.392360  0.423984  0.514932  0.633267  0.342211  0.215280  0.152031 -0.029863  0.000000
           SH600006    0.012673  0.040323  0.314285  0.008065  0.200000  0.019585  0.485715  0.024194  0.600000  ...  0.433405  0.494553  0.524930 -0.064064 -0.045677  0.133190  0.010894 -0.049861 -0.008830
           SH600007    0.066977  0.084186  0.795579  0.007907  0.093923  0.009302  0.110498  0.068372  0.812153  ...  0.291541  0.352613  0.420448  0.878055  0.514779  0.416918  0.294774  0.159104  0.007496
           SH600008    0.051163  0.082326  0.621469  0.027907  0.338984  0.003256  0.039547  0.026512  0.322033  ...  0.381852  0.400486  0.469882  0.482783  0.148572  0.236296  0.199027  0.060235 -0.015445

[5 rows x 159 columns]
[10671:MainThread](2022-03-20 11:58:45,679) INFO - qlib.workflow - [expm.py:318] - <mlflow.tracking.client.MlflowClient object at 0x7f9db78547f0>
[10671:MainThread](2022-03-20 11:58:45,708) WARNING - qlib.workflow - [expm.py:198] - No valid experiment found. Create a new experiment with name workflow.
[10671:MainThread](2022-03-20 11:58:45,712) INFO - qlib.workflow - [exp.py:257] - Experiment 1 starts running ...
[10671:MainThread](2022-03-20 11:58:46,068) INFO - qlib.workflow - [recorder.py:293] - Recorder 7efca793ef574b4ca1c0b25218f090d7 starts running under Experiment 1 ...
Training until validation scores don't improve for 50 rounds
[20]    train's l2: 0.99032     valid's l2: 0.99407
[40]    train's l2: 0.98654     valid's l2: 0.993434
[60]    train's l2: 0.983952    valid's l2: 0.993192
[80]    train's l2: 0.981825    valid's l2: 0.993024
[100]   train's l2: 0.97992     valid's l2: 0.992874
[120]   train's l2: 0.978043    valid's l2: 0.992838
[140]   train's l2: 0.976317    valid's l2: 0.992838
[160]   train's l2: 0.974702    valid's l2: 0.992849
[180]   train's l2: 0.97313     valid's l2: 0.992885
[200]   train's l2: 0.971546    valid's l2: 0.992959
Early stopping, best iteration is:
[168]   train's l2: 0.974048    valid's l2: 0.992791
[10671:MainThread](2022-03-20 11:59:55,049) INFO - qlib.workflow - [record_temp.py:194] - Signal record 'pred.pkl' has been saved as the artifact of the Experiment 1
'The following are prediction results of the LGBModel model.'
                          score
datetime   instrument
2017-01-03 SH600000   -0.036180
           SH600008    0.012532
           SH600009    0.036990
           SH600010   -0.020218
           SH600015   -0.129913
{'IC': 0.05148803800355824,
 'ICIR': 0.42031572365231823,
 'Rank IC': 0.05356542355384916,
 'Rank ICIR': 0.4463195692082111}
[10671:MainThread](2022-03-20 11:59:56,221) INFO - qlib.backtest caller - [__init__.py:82] - Create new exchange
[10671:MainThread](2022-03-20 12:00:07,081) WARNING - qlib.BaseExecutor - [executor.py:111] - `common_infra` is not set for <qlib.backtest.executor.SimulatorExecutor object at 0x7f9e4dcb6d90>
backtest loop:   0%|                                                                                                                                                                     | 0/871 [00:00<?, ?it/s]/home/q/FundFusion/env/lib/python3.8/site-packages/qlib/utils/index_data.py:480: RuntimeWarning: Mean of empty slice
  return np.nanmean(self.data)
backtest loop: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 871/871 [00:17<00:00, 49.72it/s]
[10671:MainThread](2022-03-20 12:00:25,565) INFO - qlib.workflow - [record_temp.py:499] - Portfolio analysis record 'port_analysis_1day.pkl' has been saved as the artifact of the Experiment 1
'The following are analysis results of benchmark return(1day).'
                       risk
mean               0.000477
std                0.012295
annualized_return  0.113561
information_ratio  0.598699
max_drawdown      -0.370479
'The following are analysis results of the excess return without cost(1day).'
                       risk
mean               0.000687
std                0.005483
annualized_return  0.163574
information_ratio  1.933691
max_drawdown      -0.097084
'The following are analysis results of the excess return with cost(1day).'
                       risk
mean               0.000492
std                0.005482
annualized_return  0.117084
information_ratio  1.384504
max_drawdown      -0.105894
[10671:MainThread](2022-03-20 12:00:25,584) INFO - qlib.workflow - [record_temp.py:524] - Indicator analysis record 'indicator_analysis_1day.pkl' has been saved as the artifact of the Experiment 1
'The following are analysis results of indicators(1day).'
     value
ffr    1.0
pa     0.0
pos    0.0
Exception in thread Thread-7:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/q/FundFusion/env/lib/python3.8/site-packages/qlib/utils/paral.py", line 88, in run
    data()
  File "/home/q/FundFusion/env/lib/python3.8/site-packages/qlib/workflow/recorder.py", line 373, in log_metrics
    self.client.log_metric(self.id, name, data, step=step)
  File "/home/q/FundFusion/env/lib/python3.8/site-packages/mlflow/tracking/client.py", line 690, in log_metric
    self._tracking_client.log_metric(run_id, key, value, timestamp, step)
  File "/home/q/FundFusion/env/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py", line 233, in log_metric
    self.store.log_metric(run_id, metric)
  File "/home/q/FundFusion/env/lib/python3.8/site-packages/mlflow/store/tracking/file_store.py", line 776, in log_metric
    run_info = self._get_run_info(run_id)
  File "/home/q/FundFusion/env/lib/python3.8/site-packages/mlflow/store/tracking/file_store.py", line 554, in _get_run_info
    run_info = self._get_run_info_from_dir(run_dir)
  File "/home/q/FundFusion/env/lib/python3.8/site-packages/mlflow/store/tracking/file_store.py", line 563, in _get_run_info_from_dir
    run_info = _read_persisted_run_info_dict(meta)
  File "/home/q/FundFusion/env/lib/python3.8/site-packages/mlflow/store/tracking/file_store.py", line 104, in _read_persisted_run_info_dict
    dict_copy = run_info_dict.copy()
AttributeError: 'NoneType' object has no attribute 'copy'
[10671:MainThread](2022-03-20 12:00:25,620) INFO - qlib.timer - [log.py:113] - Time cost: 0.005s | waiting `async_log` Done

And

Linux
x86_64
Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.29
#1 SMP Wed Oct 28 23:40:43 UTC 2020

Python version: 3.8.10 (default, Nov 26 2021, 20:14:08)  [GCC 9.3.0]

Qlib version: 0.8.4.99
numpy==1.21.4
pandas==1.3.4
scipy==1.7.2
requests==2.27.1
sacred==0.8.2
python-socketio==5.5.2
redis==4.1.4
python-redis-lock==3.7.0
schedule==1.1.0
cvxpy==1.2.0
hyperopt==0.1.2
fire==0.4.0
statsmodels==0.13.2
xlrd==2.0.1
plotly==5.6.0
matplotlib==3.5.0
tables==3.7.0
pyyaml==6.0
mlflow==1.24.0
tqdm==4.62.3
loguru==0.6.0
lightgbm==3.3.2
tornado==6.1
joblib==1.1.0
fire==0.4.0
ruamel.yaml==0.17.21
liujianliuku commented 2 years ago

Right, I can replay these issues when I run two processes of task_manager_rolling simultaneously, after hours running, it will fail like this.

SunsetWolf commented 2 years ago

I'm trying to replay this issue and I find that some times it does occur and some times it doesn't. Is there a stable way to replay this issue?

xmm1016 commented 2 years ago

I meet the same issue. According to the exception, I think it is related to async call by MLflow, which causes thread unsafe issue. But I don't know how to deal with this issue.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for three months with no activity. Remove the stale label or comment on the issue otherwise this will be closed in 5 days

lerit commented 2 years ago

same issue,by debug, when this error,meta.yaml is empty,that cause this error. but after continue,meta.yaml will has value

pop0121 commented 2 years ago

This error keeps poping up and is quite annoying.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for three months with no activity. Remove the stale label or comment on the issue otherwise this will be closed in 5 days