Closed sw33zy closed 1 year ago
I think the latest refactored version will no longer have those problems. You can try the preview here: https://github.com/ultmaster/nni/tree/nas-nn-refactor
Currently a workaround is to decrease max_trial_number
or increase algo-specific budget (e.g., max_collect). Or setup a watchdog to kill the experiment when it gets stuck.
Thank you for the feedback!
And how exactly can I kill an experiment?
Right now what I have is:
def start_experiment():
exp = RetiariiExperiment(model_space, evaluator, [], search_strategy)
exp_config = RetiariiExeConfig('local')
exp_config.experiment_name = 'nas_nni_dnn'
exp_config.trial_gpu_number = 1
exp_config.training_service.use_active_gpu = True
exp_config.max_experiment_duration = str(MAX_EXP_DURATION) + "s"
exp_config.trial_concurrency = 1
def get_status():
while True:
time.sleep(20)
if exp.get_status() == 'DONE':
exp.stop()
break
thread = threading.Thread(target=get_status)
thread.start()
exp.run(exp_config, 8081)
thread.join()
return exp
However, this isn't working because exp.stop()
only pauses the experiment...
What you need is a way to interrupt the main thread.
That does fix the issue, except for the PolicyBasedRL strategy. It seems it is ignoring the SIGINT. Any ideas on how to avoid this using this threading approach?
Also, any idea when that refactored version might be released?
RL strategy creates new threads internally. That's why this approach is not working.
I think the new release will be available sometime between April and May.
I'm closing the issue since it is related to #5202, and there was a workaround for the most part.
I appreciate the help and the info!
Describe the issue: I'm trying to run NNI NAS to optimize a neural architecture, but I'm encountering an issue where the program gets stuck after all trials have been completed or the max duration has been reached (I have tried both). Specifically, the program appears to be running indefinitely without providing any further output or results. Note that the experiment reaches the DONE status. I've tried running the program with different configurations and strategies. PolicyBasedRL and RegularizedEvolution seem to be the ones where this issue arises. Is this related to #5202? Is there any workaround for this problem? I have tried creating a thread to stop the experiment after the max duration had been reached and although the experiment did end, the program kept running indefinitely.
Environment:
Configuration:
Log message:
How to reproduce it?:
My code is similar to the Hello, Nas! tutorial, with different data, model space, and using the PyTorch Lightning Regression evaluator. I can provide my code if needed.