Multi_trial example with different search strategies

Annika10 commented 2 years ago

I want to try the multi_trial example with a different search strategy than random, but it causes me some troubles. For this I just remove the line

simple_strategy = strategy.Random(model_filter=LatencyFilter(threshold=100, predictor=base_predictor))

and replaced it with this for the evolutionary algorithm:

simple_strategy = strategy.RegularizedEvolution()

and with this for the reinforcement learning:

simple_strategy = strategy.PolicyBasedRL()

But by running the evolutionary algorithm, I encounter the problem that the search never ends/stops. The output is like this, but it never prints out the exported models like it is done in the multi_trial example:

[2021-12-23 11:36:06] INFO (hyperopt.utils/MainThread) Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
[2021-12-23 11:36:06] INFO (hyperopt.fmin/MainThread) Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
Files already downloaded and verified
[2021-12-23 11:36:07] INFO (pytorch_lightning.utilities.distributed/MainThread) GPU available: False, used: False
[2021-12-23 11:36:07] INFO (pytorch_lightning.utilities.distributed/MainThread) TPU available: False, using: 0 TPU cores
[2021-12-23 11:36:07] INFO (pytorch_lightning.utilities.distributed/MainThread) IPU available: False, using: 0 IPUs
[2021-12-23 11:36:07] INFO (nni.experiment/MainThread) Creating experiment, Experiment ID: mdua1j82
[2021-12-23 11:36:07] INFO (nni.experiment/MainThread) Connecting IPC pipe...
[2021-12-23 11:36:08] INFO (nni.experiment/MainThread) Starting web server...
[2021-12-23 11:36:09] INFO (nni.experiment/MainThread) Setting up...
[2021-12-23 11:36:09] INFO (nni.runtime.msg_dispatcher_base/Thread-3) Dispatcher started
[2021-12-23 11:36:09] INFO (nni.retiarii.experiment.pytorch/MainThread) Web UI URLs: http://127.0.0.1:8081
/home/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/nn/pytorch/api.py:154: UserWarning: You should not run forward of this module directly.
  warnings.warn('You should not run forward of this module directly.')
/home/code/nni/examples/nas/oneshot/spos/blocks.py:81: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  x = x.reshape(bs * num_channels // 2, 2, height * width)
/home/code/nni/examples/nas/oneshot/spos/blocks.py:83: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  x = x.reshape(2, -1, num_channels // 2, height, width)
[2021-12-23 11:37:06] INFO (nni.retiarii.experiment.pytorch/MainThread) Start strategy...
[2021-12-23 11:37:06] INFO (nni.retiarii.strategy.evolution/MainThread) Initializing the first population.
/home/code/nni/venv/lib/python3.8/site-packages/json_tricks/nonp.py:221: JsonTricksDeprecation: `json_tricks.load(s)` stripped some comments, but `ignore_comments` was not passed; in the next major release, the behaviour when `ignore_comments` is not passed will change; it is recommended to explicitly pass `ignore_comments=True` if you want to strip comments; see https://github.com/mverleg/pyjson_tricks/issues/74
  warnings.warn('`json_tricks.load(s)` stripped some comments, but `ignore_comments` was '
[2021-12-23 11:42:05] INFO (pytorch_lightning.utilities.distributed/Thread-2) GPU available: False, used: False
[2021-12-23 11:42:05] INFO (pytorch_lightning.utilities.distributed/Thread-2) TPU available: False, using: 0 TPU cores
[2021-12-23 11:42:05] INFO (pytorch_lightning.utilities.distributed/Thread-2) IPU available: False, using: 0 IPUs
Files already downloaded and verified
[2021-12-23 11:42:25] INFO (pytorch_lightning.utilities.distributed/Thread-2) GPU available: False, used: False
[2021-12-23 11:42:25] INFO (pytorch_lightning.utilities.distributed/Thread-2) TPU available: False, using: 0 TPU cores
[2021-12-23 11:42:25] INFO (pytorch_lightning.utilities.distributed/Thread-2) IPU available: False, using: 0 IPUs
Files already downloaded and verified
[2021-12-23 11:42:34] INFO (nni.retiarii.experiment.pytorch/Thread-4) Stopping experiment, please wait...
[2021-12-23 11:42:35] INFO (nni.runtime.msg_dispatcher_base/Thread-3) Dispatcher exiting...
[2021-12-23 11:42:35] INFO (nni.retiarii.experiment.pytorch/Thread-4) Experiment stopped
[2021-12-23 11:42:37] INFO (nni.runtime.msg_dispatcher_base/Thread-3) Dispatcher terminiated

By using the reinforcement learning algorithm I get the following error:

[2021-12-23 11:47:41] INFO (nni.retiarii.experiment.pytorch/MainThread) Start strategy...
Traceback (most recent call last):
  File "examples/nas/oneshot/spos/multi_trial.py", line 193, in <module>
    _main()
  File "/home/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "examples/nas/oneshot/spos/multi_trial.py", line 185, in _main
    exp.run(exp_config, port)
  File "/home/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/experiment/pytorch.py", line 296, in run
    self.start(port, debug)
  File "/home/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/experiment/pytorch.py", line 268, in start
    self._start_strategy()
  File "/home/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/experiment/pytorch.py", line 196, in _start_strategy
    self.strategy.run(base_model_ir, self.applied_mutators)
  File "/home/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/strategy/rl.py", line 71, in run
    env = BaseVectorEnv([env_fn for _ in range(concurrency)], MultiThreadEnvWorker)
  File "/home/code/nni/venv/lib/python3.8/site-packages/tianshou/env/venvs.py", line 86, in __init__
    self.workers = [worker_fn(fn) for fn in env_fns]
  File "/home/code/nni/venv/lib/python3.8/site-packages/tianshou/env/venvs.py", line 86, in <listcomp>
    self.workers = [worker_fn(fn) for fn in env_fns]
TypeError: Can't instantiate abstract class MultiThreadEnvWorker with abstract methods get_env_attr, set_env_attr

I also tried to use the status of the current master branch because I would love to use the model_filter for the evolutionary algorithm, but this doesn't work because of the error mentioned in this issue.

NNI version: v2.5
tianshou version: 0.4.5

Annika10 commented 2 years ago

Could solve the stopping problem for the evolutionary algorithm by tuning the population_size, the sample_size and the number of cycles of the strategy in combination with the max trial number of the experiment.

ultmaster commented 2 years ago

Sorry for the late response.

You are right that evolution has its own control of population and cycles. If not set correctly, the experiment could experience situations that it never ends. It's a known issue, but we didn't come up with a good solution yet.

About RL, it's a tianshou compatibility issue. please downgrade tianshou to v0.4.4 for now. Or you can try NNI by installing from source. The issue should have already been fixed.

Annika10 commented 2 years ago

Many thanks for your answer! Are there also some restrictions for the RL how the max_collect and trial_per_collect have to be defined to not run the search infinitely?

ultmaster commented 2 years ago

I think it should be similar to evolution. max_collect should be set close to total budget.

Annika10 commented 2 years ago

Okay, but when I set for example the max_collect=2 and the trial_per_collect=1 in the RL strategy and in the experiment the max_trial_number=2, then I get the following error which is also mentioned in the issue:

Traceback (most recent call last):
  File "examples/nas/oneshot/spos/search.py", line 84, in <module>
    _main()
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "examples/nas/oneshot/spos/search.py", line 74, in _main
    exp.run(exp_config, port)
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/experiment/pytorch.py", line 296, in run
    self.start(port, debug)
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/experiment/pytorch.py", line 268, in start
    self._start_strategy()
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/experiment/pytorch.py", line 196, in _start_strategy
    self.strategy.run(base_model_ir, self.applied_mutators)
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/strategy/rl.py", line 76, in run
    result = collector.collect(n_episode=self.trial_per_collect)
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/tianshou/data/collector.py", line 237, in collect
    result = self.env.step(action_remap, ready_env_ids)  # type: ignore
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/tianshou/env/venvs.py", line 225, in step
    obs, rew, done, info = self.workers[j].get_result()
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/strategy/_rl_impl.py", line 46, in get_result
    return self.result.get()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/strategy/_rl_impl.py", line 108, in step
    submit_models(model)
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/execution/api.py", line 44, in submit_models
    engine.submit_models(*models)
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/execution/base.py", line 64, in submit_models
    self._running_models[send_trial(data.dump())] = model
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/integration_api.py", line 35, in send_trial
    return get_advisor().send_trial(parameters, placement_constraint)
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/integration.py", line 124, in send_trial
    send(CommandType.NewTrialJob, json_dumps(new_trial))
  File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/runtime/protocol.py", line 51, in send
    _out_file.write(msg)
ValueError: write to closed file

ultmaster commented 2 years ago

Emm, that could be a problem with concurrency.

Could you try to set trial_concurrency to 1? If my guess is true, that might be another issue that isn't easy to handle...

Annika10 commented 2 years ago

trial_concurrency to 1 results in the same error about the write to closed file

ultmaster commented 2 years ago

I tried the same example. I observe that the experiment hangs instead of a ValueError write to closed file.

I need more investigations into this.

ultmaster commented 2 years ago

I find that replacing MultiThreadEnvWorker with the dummy one will work (in nni/retiarii/strategy/rl.py).

Will completely break the trial concurrency feature. Could be a temporary fix though.

Annika10 commented 2 years ago

Thanks, perfect this works for me!

microsoft / nni

Multi_trial example with different search strategies #4421