microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.05k stars 1.82k forks source link

Could not view any information on Trails detail. #5354

Open Nafees-060 opened 1 year ago

Nafees-060 commented 1 year ago

Describe the issue: I am using NNI for hyperparameters optimization. In the Overview tabs, I can see all the information like Duration etc. But one strange thing is I could not see any update in # Trail numbers section even though my experiments are running for the last sixteen hours. Second, the Trail details tab is still blank. Moreover, in the dispatcher.log I can see the following error:

[2023-02-14 19:57:14] INFO (nni.tuner.tpe/MainThread) Using random seed 2064954602
[2023-02-14 19:57:14] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2023-02-14 19:57:14] ERROR (nni.runtime.msg_dispatcher_base/Thread-1) '_type'
Traceback (most recent call last):
  File "/home/anafees/.local/lib/python3.9/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
    self.process_command(command, data)
  File "/home/anafees/.local/lib/python3.9/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
    command_handlers[command](data)
  File "/home/anafees/.local/lib/python3.9/site-packages/nni/runtime/msg_dispatcher.py", line 90, in handle_initialize
    self.tuner.update_search_space(data)
  File "/home/anafees/.local/lib/python3.9/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 169, in update_search_space
    self.space = format_search_space(space)
  File "/home/anafees/.local/lib/python3.9/site-packages/nni/common/hpo_utils/formatting.py", line 99, in format_search_space
    formatted = _format_search_space(tuple(), search_space)
  File "/home/anafees/.local/lib/python3.9/site-packages/nni/common/hpo_utils/formatting.py", line 177, in _format_search_space
    formatted.append(_format_parameter(key, spec['_type'], spec['_value']))
KeyError: '_type' ```

I am very new to using NNI, and I am not sure whether I should ask these questions or not. Thanks for your help.

Environment:

experimentName: Abc_D # An optional name to distinguish the experiments

searchSpaceFile: search_space.yaml # Specify the Search Space file path useAnnotation: false # If it is true, searchSpaceFile will be ignore. default: false

trialCommand: python3.9 main.py # NOTE: change "python3" to "python" if you are using Windows trialCodeDirectory: . # Specify the Trial file path trialGpuNumber: 1 # Each trial needs 1 gpu trialConcurrency: 30 # Run 30 trials concurrently

maxExperimentDuration: 24h # Stop generating all trials after 24 hour maxTrialNumber: 1000 # Generate at most 1000 trials

tuner: # Configure the tuning algorithm name: TPE classArgs: # Algorithm specific arguments optimize_mode: maximize # maximize or minimize the needed metrics

trainingService: # Configure the training platform platform: local # Include local, remote, pai, etc. gpuIndices: 0, 1, 2 # The gpu-id 2 and 3 will be used useActiveGpu: True # Whether to use the gpu that has been used by other processes. maxTrialNumberPerGpu: 10 # Default: 1. Specify how many trials can share one GPU.

 **Search space:** 

`searchSpace: batch_size: _type: choice _value: [20, 40, 60] lr: _type: choice _value: [0.001,0.000001]

first_dim: _type: choice _value: [64, 128, 256] last_dim: _type: choice _value: [16, 32, 64, 128] epochs: _type: choice _value: [80] dropout_prob: _type: uniform _value: [0.5, 0.7] `

Nafees-060 commented 1 year ago

@liuzhe-lz can you please check this issue posted above? Waiting for your response. Thanks

liuzhe-lz commented 1 year ago

If search space is a separate file, the first line searchSpace: should not be there. Please remove it and try again.

Nafees-060 commented 1 year ago

@liuzhe-lz Oh thanks alot. Now it seems that it is working. On the other side in NNIManager log I can see some warnings:

[2023-02-16 13:56:08] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:56:13] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:56:18] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:56:23] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:56:28] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:56:33] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:56:38] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:56:43] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:56:48] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:56:53] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:56:58] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:03] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:08] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:13] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:18] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:23] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:28] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:33] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:38] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:43] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:48] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:53] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:57:58] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:03] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:08] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:13] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:18] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:23] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:28] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:33] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:38] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:43] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:48] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:53] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:58:58] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:59:03] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:59:08] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:59:13] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:59:18] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:59:23] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:59:28] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:59:33] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:59:38] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:59:43] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:59:48] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-02-16 13:59:53] WARNING (GPUScheduler) gpu_metrics file does not exist!

liuzhe-lz commented 1 year ago

Please set trialGpuNumber to 0 if you don't use GPU scheduling.

Nafees-060 commented 1 year ago

@liuzhe-lz If scheduling implies running my program on GPUs, then of course I have a Machine with three GPUs and would like to run nni HPO on all three GPUs. You can see in the experimental configuration that I specified three GPU indices (gpuIndices: 0, 1, 2), which led me to believe that my program would be executed in parallel on three GPUs.

Nafees-060 commented 1 year ago

@liuzhe-lz Second my all trial jobs are on WAITING Status for the last 20 hours. I am not sure what is the problem and why I did not get results even for a single job. I know you will be busy, can you please reply fast?

liuzhe-lz commented 1 year ago

For unknown reason the GPU scheduler is misbehaving. It seems to be a bug. The scheduler will be rewritten in upcoming release.

Trials are WAITING because the scheduler is not working and cannot provide idle GPU.