microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14k stars 1.81k forks source link

Some search strategies get stuck #5415

Closed sw33zy closed 1 year ago

sw33zy commented 1 year ago

Describe the issue: I'm trying to run NNI NAS to optimize a neural architecture, but I'm encountering an issue where the program gets stuck after all trials have been completed or the max duration has been reached (I have tried both). Specifically, the program appears to be running indefinitely without providing any further output or results. Note that the experiment reaches the DONE status. I've tried running the program with different configurations and strategies. PolicyBasedRL and RegularizedEvolution seem to be the ones where this issue arises. Is this related to #5202? Is there any workaround for this problem? I have tried creating a thread to stop the experiment after the max duration had been reached and although the experiment did end, the program kept running indefinitely.

Environment:

Configuration:

{
  "params": {
    "experimentType": "nas",
    "trialCommand": "C:\\Users\\Leonardo\\anaconda3\\envs\\omnia\\python.exe -m nni.retiarii.trial_entry py",
    "trialCodeDirectory": "C:\\Users\\Leonardo\\Documents\\Universidade Leo\\5º ano\\tese\\Omnia\\omnia-deep-learning\\tests",
    "trialConcurrency": 1,
    "trialGpuNumber": 1,
    "maxExperimentDuration": "180s",
    "useAnnotation": false,
    "debug": false,
    "logLevel": "info",
    "experimentWorkingDirectory": "C:\\Users\\Leonardo\\nni-experiments",
    "trainingService": {
      "platform": "local",
      "trialCommand": "C:\\Users\\Leonardo\\anaconda3\\envs\\omnia\\python.exe -m nni.retiarii.trial_entry py",
      "trialCodeDirectory": "C:\\Users\\Leonardo\\Documents\\Universidade Leo\\5º ano\\tese\\Omnia\\omnia-deep-learning\\tests",
      "trialGpuNumber": 1,
      "debug": false,
      "useActiveGpu": true,
      "maxTrialNumberPerGpu": 1,
      "reuseMode": false
    },
    "executionEngine": {
      "name": "py"
    }
  },
  "execDuration": "3m 5s",
  "nextSequenceId": 100,
  "revision": 24
}
{
  "hidden_dim1": {
    "_type": "choice",
    "_value": [16, 32, 64, 128, 256]
  },
  "dropout1": {
    "_type": "choice",
    "_value": [0, 0.25, 0.5]
  },
  "model_1": {
    "_type": "choice",
    "_value": [0, 1]
  },
  "hidden_dim2": {
    "_type": "choice",
    "_value": [16, 32, 64, 128, 256]
  },
  "dropout2": {
    "_type": "choice",
    "_value": [0, 0.25, 0.5]
  },
  "model_2": {
    "_type": "choice",
    "_value": [0, 1]
  },
  "block_iter": {
    "_type": "choice",
    "_value": [1, 2, 3, 4, 5]
  },
  "block_values0": {
    "_type": "choice",
    "_value": [16, 32, 64, 128, 256]
  },
  "block_dropout0": {
    "_type": "choice",
    "_value": [0, 0.1, 0.2]
  },
  "output_dim": {
    "_type": "choice",
    "_value": [16, 32, 64, 128, 256]
  },
  "block_values1": {
    "_type": "choice",
    "_value": [16, 32, 64, 128, 256]
  },
  "block_dropout1": {
    "_type": "choice",
    "_value": [0, 0.1, 0.2]
  },
  "block_values2": {
    "_type": "choice",
    "_value": [16, 32, 64, 128, 256]
  },
  "block_dropout2": {
    "_type": "choice",
    "_value": [0, 0.1, 0.2]
  },
  "block_values3": {
    "_type": "choice",
    "_value": [16, 32, 64, 128, 256]
  },
  "block_dropout3": {
    "_type": "choice",
    "_value": [0, 0.1, 0.2]
  },
  "block_values4": {
    "_type": "choice",
    "_value": [16, 32, 64, 128, 256]
  },
  "block_dropout4": {
    "_type": "choice",
    "_value": [0, 0.1, 0.2]
  },
  "cell/op_1_0": {
    "_type": "choice",
    "_value": ["0", "1"]
  },
  "cell/op_2_0": {
    "_type": "choice",
    "_value": ["0", "1"]
  },
  "cell/op_3_0": {
    "_type": "choice",
    "_value": ["0", "1"]
  },
  "cell/op_4_0": {
    "_type": "choice",
    "_value": ["0", "1"]
  },
  "cell/input_1_0": {
    "_type": "choice",
    "_value": [0]
  },
  "cell/input_2_0": {
    "_type": "choice",
    "_value": [0, 1]
  },
  "cell/input_3_0": {
    "_type": "choice",
    "_value": [0, 1, 2]
  },
  "cell/input_4_0": {
    "_type": "choice",
    "_value": [0, 1, 2, 3]
  }
}

Log message:

How to reproduce it?:

My code is similar to the Hello, Nas! tutorial, with different data, model space, and using the PyTorch Lightning Regression evaluator. I can provide my code if needed.

matluster commented 1 year ago

I think the latest refactored version will no longer have those problems. You can try the preview here: https://github.com/ultmaster/nni/tree/nas-nn-refactor

Currently a workaround is to decrease max_trial_number or increase algo-specific budget (e.g., max_collect). Or setup a watchdog to kill the experiment when it gets stuck.

sw33zy commented 1 year ago

Thank you for the feedback!

And how exactly can I kill an experiment?

Right now what I have is:

def start_experiment():
    exp = RetiariiExperiment(model_space, evaluator, [], search_strategy)
    exp_config = RetiariiExeConfig('local')
    exp_config.experiment_name = 'nas_nni_dnn'
    exp_config.trial_gpu_number = 1
    exp_config.training_service.use_active_gpu = True

    exp_config.max_experiment_duration = str(MAX_EXP_DURATION) + "s"
    exp_config.trial_concurrency = 1

    def get_status():
        while True:
            time.sleep(20)
            if exp.get_status() == 'DONE':
                exp.stop()
                break

    thread = threading.Thread(target=get_status)
    thread.start()

    exp.run(exp_config, 8081)
    thread.join()
    return exp

However, this isn't working because exp.stop() only pauses the experiment...

ultmaster commented 1 year ago

What you need is a way to interrupt the main thread.

sw33zy commented 1 year ago

That does fix the issue, except for the PolicyBasedRL strategy. It seems it is ignoring the SIGINT. Any ideas on how to avoid this using this threading approach?

Also, any idea when that refactored version might be released?

ultmaster commented 1 year ago

RL strategy creates new threads internally. That's why this approach is not working.

I think the new release will be available sometime between April and May.

sw33zy commented 1 year ago

I'm closing the issue since it is related to #5202, and there was a workaround for the most part.

I appreciate the help and the info!