microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.04k stars 1.81k forks source link

WARNING [ 'gpu_metrics file does not exist!' ] #2587

Closed kaijieshi7 closed 3 years ago

kaijieshi7 commented 4 years ago

Environment:ubuntu 16.04

Log message:

What issue meet, what's expected?:

How to reproduce it?:

Additional information:

SparkSnail commented 4 years ago

Hi @kaijieshi7, could you please check if there is metric file under /tmp/{.username}/nni/script? and use ps -ef | grep nni to check if there is a nni_gpu_metric backend process? kill all of backend process related of nni, and restart the experiment.

kaijieshi7 commented 4 years ago

where is '/tmp/{.username}/nni/script',i cant find it

SparkSnail commented 4 years ago

where is '/tmp/{.username}/nni/script',i cant find it

{username} means your username in your system, for example, /tmp/matrix/nni/script

kaijieshi7 commented 4 years ago

Do i need run some code? Im not run some code relate with 'nni', and i cant find '/tmp/matrix/nni/script'

SparkSnail commented 4 years ago

Hi @kaijieshi7 , when NNI start an experiment, it will write gpu metric file to /tmp/matrix/nni/scripts/gpu_metrics file. Since your log has a gpu metric file does no exist error, I wonder if this file is created correctly. You can use

METRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector

script to test generating gpu_metrics file in you machine.

kaijieshi7 commented 4 years ago

i run 'METRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector', Screenshot from 2020-06-26 13-54-32 and when i run other's code, i can find '/tmp/matrix/nni/scripts/gpu_metrics' file ;but the gpu is still can't work, just like what i have post first.

SparkSnail commented 4 years ago

Hi @kaijieshi7 , NNI will not use gpus which have another process running on it, you could set useActiveGPU configuration to use these busy gpus, and set maxTrialNumPerGpu to specify how many trial jobs on one gpu, refer https://github.com/microsoft/nni/blob/master/docs/en_US/Tutorial/ExperimentConfig.md#useactivegpu. https://github.com/microsoft/nni/blob/master/examples/trials/cifar10_pytorch/config.yml#L22

kaijieshi7 commented 4 years ago

666,it's useful , thank you guys.

Davidxswang commented 3 years ago

I ran into the same problem in Windows 10. But I fixed it by

  1. running the command: $env:METRIC_OUTPUT_DIR = 'd:/'
  2. copy the generated gpu_metrics file into the temp folder

How to reproduce it? Open nni/examples/trials/mnist-pytorch/config_windows.yml, change the gpuNum into 1, add localConfig: useActiveGpu: true, then run the experiment. image

Hopefully my solution can help some of the people who may run into the same problem in the future.

EroData commented 3 years ago

Hi @kaijieshi7 , when NNI start an experiment, it will write gpu metric file to /tmp/matrix/nni/scripts/gpu_metrics file. Since your log has a gpu metric file does no exist error, I wonder if this file is created correctly. You can use

METRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector

script to test generating gpu_metrics file in you machine.

I met the same question, "gpu_metrics file does not exist!"

and while run the following code directly: METRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector

it replys 'nni_gpu_tool' not found, so after 'ps -ef | grep nni', I found it changed to nni.tools.gpu_tool.gpu_metrics_collector

so I changed it to the following code, and run again: METRIC_OUTPUT_DIR=./ python3 -m nni.tools.gpu_tool.gpu_metrics_collector

but met the following error, I didn't know how to do next. image

YCccc-git commented 2 years ago

I ran into the same problem in Windows 10. But I fixed it by

  1. running the command: $env:METRIC_OUTPUT_DIR = 'd:/'
  2. copy the generated gpu_metrics file into the temp folder

How to reproduce it? Open nni/examples/trials/mnist-pytorch/config_windows.yml, change the gpuNum into 1, add localConfig: useActiveGpu: true, then run the experiment. image

Hopefully my solution can help some of the people who may run into the same problem in the future.

  1. $env:METRIC_OUTPUT_DIR = 'd:/'

I am glad to see your suggestion, but please tell me how the command " $env:METRIC_OUTPUT_DIR = 'd:/'" works under win

coco11563 commented 2 years ago

Hi, all fellows. I ran into this warning recently, my solution is quite easy. (On ubuntu) step1: export the env export METRIC_OUTPUT_DIR="/home/username/tmp" step2: create the file by running code: python3 -m nni_gpu_tool.gpu_metrics_collector after the json output like: {"Timestamp": "Thu Sep 8 13:34:10 2022", "gpuCount": 8, "gpuInfos": [{"activeProcessNum": 12, "gpuMemUtil": "0", "gpuUtil": "0", "index": 0}, {"activeProcessNum": 10, "gpuMemUtil": "0", "gpuUtil": "0", "index": 1}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 2}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 3}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 4}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 5}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 6}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 7}]} then run the nnictl --create xxx No more warning pop out.

WeigangLu commented 1 year ago

Hi @kaijieshi7 , when NNI start an experiment, it will write gpu metric file to /tmp/matrix/nni/scripts/gpu_metrics file. Since your log has a gpu metric file does no exist error, I wonder if this file is created correctly. You can use

METRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector

script to test generating gpu_metrics file in you machine.

I met the same question, "gpu_metrics file does not exist!"

and while run the following code directly: METRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector

it replys 'nni_gpu_tool' not found, so after 'ps -ef | grep nni', I found it changed to nni.tools.gpu_tool.gpu_metrics_collector

so I changed it to the following code, and run again: METRIC_OUTPUT_DIR=./ python3 -m nni.tools.gpu_tool.gpu_metrics_collector

but met the following error, I didn't know how to do next. image

I also have this issue. How do you fix it?

redLinmumu commented 1 year ago

Hi, all fellows. I ran into this warning recently, my solution is quite easy. (On ubuntu) step1: export the env export METRIC_OUTPUT_DIR="/home/username/tmp" step2: create the file by running code: python3 -m nni_gpu_tool.gpu_metrics_collector after the json output like: {"Timestamp": "Thu Sep 8 13:34:10 2022", "gpuCount": 8, "gpuInfos": [{"activeProcessNum": 12, "gpuMemUtil": "0", "gpuUtil": "0", "index": 0}, {"activeProcessNum": 10, "gpuMemUtil": "0", "gpuUtil": "0", "index": 1}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 2}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 3}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 4}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 5}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 6}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 7}]} then run the nnictl --create xxx No more warning pop out.

How does it work? what I got is : python3: Error while finding module specification for 'nni_gpu_tool.gpu_metrics_collector' (ModuleNotFoundError: No module named 'nni_gpu_tool')