Open Annika10 opened 2 years ago
Could solve the stopping problem for the evolutionary algorithm by tuning the population_size
, the sample_size
and the number of cycles
of the strategy in combination with the max trial number of the experiment.
Sorry for the late response.
You are right that evolution has its own control of population and cycles. If not set correctly, the experiment could experience situations that it never ends. It's a known issue, but we didn't come up with a good solution yet.
About RL, it's a tianshou compatibility issue. please downgrade tianshou to v0.4.4 for now. Or you can try NNI by installing from source. The issue should have already been fixed.
Many thanks for your answer! Are there also some restrictions for the RL how the max_collect
and trial_per_collect
have to be defined to not run the search infinitely?
I think it should be similar to evolution. max_collect
should be set close to total budget.
Okay, but when I set for example the max_collect=2
and the trial_per_collect=1
in the RL strategy and in the experiment the max_trial_number=2
, then I get the following error which is also mentioned in the issue:
Traceback (most recent call last):
File "examples/nas/oneshot/spos/search.py", line 84, in <module>
_main()
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "examples/nas/oneshot/spos/search.py", line 74, in _main
exp.run(exp_config, port)
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/experiment/pytorch.py", line 296, in run
self.start(port, debug)
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/experiment/pytorch.py", line 268, in start
self._start_strategy()
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/experiment/pytorch.py", line 196, in _start_strategy
self.strategy.run(base_model_ir, self.applied_mutators)
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/strategy/rl.py", line 76, in run
result = collector.collect(n_episode=self.trial_per_collect)
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/tianshou/data/collector.py", line 237, in collect
result = self.env.step(action_remap, ready_env_ids) # type: ignore
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/tianshou/env/venvs.py", line 225, in step
obs, rew, done, info = self.workers[j].get_result()
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/strategy/_rl_impl.py", line 46, in get_result
return self.result.get()
File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/strategy/_rl_impl.py", line 108, in step
submit_models(model)
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/execution/api.py", line 44, in submit_models
engine.submit_models(*models)
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/execution/base.py", line 64, in submit_models
self._running_models[send_trial(data.dump())] = model
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/integration_api.py", line 35, in send_trial
return get_advisor().send_trial(parameters, placement_constraint)
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/retiarii/integration.py", line 124, in send_trial
send(CommandType.NewTrialJob, json_dumps(new_trial))
File "/home/annika/code/nni/venv/lib/python3.8/site-packages/nni/runtime/protocol.py", line 51, in send
_out_file.write(msg)
ValueError: write to closed file
Emm, that could be a problem with concurrency.
Could you try to set trial_concurrency
to 1? If my guess is true, that might be another issue that isn't easy to handle...
trial_concurrency
to 1 results in the same error about the write to closed file
I tried the same example. I observe that the experiment hangs instead of a ValueError write to closed file
.
I need more investigations into this.
I find that replacing MultiThreadEnvWorker with the dummy one will work (in nni/retiarii/strategy/rl.py).
Will completely break the trial concurrency feature. Could be a temporary fix though.
Thanks, perfect this works for me!
I want to try the multi_trial example with a different search strategy than random, but it causes me some troubles. For this I just remove the line
and replaced it with this for the evolutionary algorithm:
and with this for the reinforcement learning:
But by running the evolutionary algorithm, I encounter the problem that the search never ends/stops. The output is like this, but it never prints out the exported models like it is done in the multi_trial example:
By using the reinforcement learning algorithm I get the following error:
I also tried to use the status of the current master branch because I would love to use the
model_filter
for the evolutionary algorithm, but this doesn't work because of the error mentioned in this issue.