[Revise] improve parameter estimation performance with shorter timeout_iteration (and constant liar optionally)

lisphilar commented 3 years ago

Summary of this new feature

Improve performance (estimation score and runtime) of parameter estimation with the following solutions.

improve estmation score with constant liar optuna provides new option constant_liar of TPESampler at version 2.8.0. Constant Liar heuristic reduces search effort, avoiding trials which trys similar parameter sets. Please refer to their detailed explanations and discussions with Optuna version 2.8.0 release note. It will be great for CovsirPhy users to use constant_liar=True if Optuna version 2.8.0 is available in our environments.
Improve runtime with shorter time_iteration At version 2.20.3, Scenario.estimate(timeout_iteration=5) is the default value. Estimation score (RMSLE as default) is calculated every five seconds and the socre was not changed for tail_n=4 iterations, estimation will be stopped and best parameter set will be returned. However, with my tests, timeout_iteration appears to be a bottleneck. Many phases runs 5 seconds. (i.e. when timeout_iteration is shorter, runtime may be shorter.)

Note regarding constant liar: constant_liar argument cannot be applied with Optuna version 2.7.0 or older. https://gist.github.com/lisphilar/6440b5d69c4984bb0b34ede8c8ebcca3

TypeError means we use Optuna version 2.7.0 or older. When covsirphy get TypeError with constant_liar argument, it should remove the arument and retry creating TPESampler.

lisphilar commented 3 years ago

At version CovsirPhy 2.3.0 with Italy data (as of 18Jun2021), example/scenario_analysis.py and 8 CPUs at my local environment, parameter estimation completed with RMSLE=0.07595 in 2 min 22 sec.

(Please ignore accuracy of the last phase of Forecast scenario because this is a forecasted future phase.) ita_14_history_Infected

Update: RMSLE score was fixed. 0.0795 -> 0.07595

lisphilar commented 3 years ago

I compared the performances, changing constant_liar and timeout_iteration with Italy data as of 18Jun2021, my local environment and CovsirPhy version 2.20.3-theta. I used only 1 CPU with n_jobs=1 to get robust values of runtime as total value of all phases. Parameter estimation of each phase was done seaquencially. Code are as follows.

import covsirphy as cs
loader = cs.DataLoader()
jhu_data = loader.jhu()
snl = cs.Scenario(country="Italy")
snl.register(jhu_data)
snl.trend()
snl.estimate(cs.SIRF, n_jobs=1)
print(f"RMSLE: {snl.score(metric='RMSLE')}")

Results are here.

RMSLE (runtime)	constant_liar=False	constant_liar=True
timeout_iteration=5	0.06810 (13 min 22 sec)	0.06868 (17 min 42 sec)
timeout_iteration=4	0.06812 (14 min 03 sec)	0.06869 (14 min 07 sec)
timeout_iteration=3	0.06808 (10 min 10 sec)	0.06871 (10 min 31 sec)
timeout_iteration=2	0.06811 (07 min 55 sec)	0.06865 (07 min 11 sec)
timeout_iteration=1	0.06806 (03 min 21 sec)	0.06901 (03 min 53 sec)

I expected constant_liar=True and timeout_iteration=1 would show the best performance, but these results indicated constant_liar=False and timeout_iteration=1. I will create a pull request for constant_liar=False and timeout_iteration=1. These default values may be changed later if we get different results with the other countries' data.

lisphilar commented 3 years ago

With #833,

Use Scenario.estimate(<model>, timeout_iteration=1) as default.
Use constant_liar=False explicitly.

Later, I will add constant_liar=False as an argument of Scenario.estimate(), if necessary.

lisphilar commented 3 years ago

WIth #835, user can select whether use constant liar or not with Scenario.esitmate(<model>, constant_liar=False) (default).

lisphilar commented 3 years ago

I compared RMSLE scores and runtime of constant_liar=False (default at this time) and constant_liar=True with some countries' datasets. I used example/scenario_analysis.py with 8 CPUs.

Results are here.

iso3	Country	constant_liar=False	constant_liar=True	Better RMSLE	Better runtime	Winner
ita	Italy	0.07642 (27 sec)	0.07686 (29 sec)	FALSE	FALSE	FALSE
jpn	Japan	0.06103 (39 sec)	0.06200 (44 sec)	FALSE	FALSE	FALSE
grc	Greece	0.05472 (37 sec)	0.05107 (44 sec)	TRUE	FALSE	NA
nld	Netherlands	0.03719 (37 sec)	0.03706 (28 sec)	TRUE	TRUE	TRUE
usa	USA	0.23073 (33 sec)	0.24186 (22 sec)	FALSE	TRUE	NA
ind	India	0.21665 (36 sec)	0.21871 (50 sec)	FALSE	FALSE	FALSE
bra	Brazil	0.06754 (53 sec)	0.06634 (63 sec)	TRUE	FALSE	NA
rus	Russia	0.61374 (38 sec)	0.61293 (28 sec)	TRUE	TRUE	TRUE

Because there was no significant difference, we continue to use constant_liar=False as default. For Netherlands and Russia, it will be better to use Scenario.estimate(cs.SIRF, constant_liar=True).

lisphilar commented 3 years ago

Runtime of parameter estimation will be quite shorter with timeout_iteration=1 (default). Version 2.21.0 release was planed in Jul2021, but this should be moved up to Jun2021. Tomorrow or within some days.

lisphilar / covid19-sir

[Revise] improve parameter estimation performance with shorter timeout_iteration (and constant liar optionally) #833

Summary of this new feature