Closed kaijieshi7 closed 3 years ago
Hi @kaijieshi7, could you please check if there is metric file under /tmp/{.username}/nni/script
? and use ps -ef | grep nni
to check if there is a nni_gpu_metric
backend process? kill all of backend process related of nni, and restart the experiment.
where is '/tmp/{.username}/nni/script',i cant find it
where is '/tmp/{.username}/nni/script',i cant find it
{username} means your username in your system, for example, /tmp/matrix/nni/script
Do i need run some code? Im not run some code relate with 'nni', and i cant find '/tmp/matrix/nni/script'
Hi @kaijieshi7 , when NNI start an experiment, it will write gpu metric file to /tmp/matrix/nni/scripts/gpu_metrics
file. Since your log has a gpu metric file does no exist
error, I wonder if this file is created correctly.
You can use
METRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector
script to test generating gpu_metrics file in you machine.
i run 'METRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector', and when i run other's code, i can find '/tmp/matrix/nni/scripts/gpu_metrics' file ;but the gpu is still can't work, just like what i have post first.
Hi @kaijieshi7 , NNI will not use gpus which have another process running on it, you could set useActiveGPU
configuration to use these busy gpus, and set maxTrialNumPerGpu
to specify how many trial jobs on one gpu, refer https://github.com/microsoft/nni/blob/master/docs/en_US/Tutorial/ExperimentConfig.md#useactivegpu.
https://github.com/microsoft/nni/blob/master/examples/trials/cifar10_pytorch/config.yml#L22
666,it's useful , thank you guys.
I ran into the same problem in Windows 10. But I fixed it by
How to reproduce it? Open nni/examples/trials/mnist-pytorch/config_windows.yml, change the gpuNum into 1, add localConfig: useActiveGpu: true, then run the experiment.
Hopefully my solution can help some of the people who may run into the same problem in the future.
Hi @kaijieshi7 , when NNI start an experiment, it will write gpu metric file to
/tmp/matrix/nni/scripts/gpu_metrics
file. Since your log has agpu metric file does no exist
error, I wonder if this file is created correctly. You can useMETRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector
script to test generating gpu_metrics file in you machine.
I met the same question, "gpu_metrics file does not exist!"
and while run the following code directly: METRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector
it replys 'nni_gpu_tool' not found, so after 'ps -ef | grep nni', I found it changed to nni.tools.gpu_tool.gpu_metrics_collector
so I changed it to the following code, and run again: METRIC_OUTPUT_DIR=./ python3 -m nni.tools.gpu_tool.gpu_metrics_collector
but met the following error, I didn't know how to do next.
I ran into the same problem in Windows 10. But I fixed it by
- running the command: $env:METRIC_OUTPUT_DIR = 'd:/'
- copy the generated gpu_metrics file into the temp folder
How to reproduce it? Open nni/examples/trials/mnist-pytorch/config_windows.yml, change the gpuNum into 1, add localConfig: useActiveGpu: true, then run the experiment.
Hopefully my solution can help some of the people who may run into the same problem in the future.
- $env:METRIC_OUTPUT_DIR = 'd:/'
I am glad to see your suggestion, but please tell me how the command " $env:METRIC_OUTPUT_DIR = 'd:/'" works under win
Hi, all fellows. I ran into this warning recently, my solution is quite easy. (On ubuntu)
step1: export the env
export METRIC_OUTPUT_DIR="/home/username/tmp"
step2: create the file by running code:
python3 -m nni_gpu_tool.gpu_metrics_collector
after the json output like:
{"Timestamp": "Thu Sep 8 13:34:10 2022", "gpuCount": 8, "gpuInfos": [{"activeProcessNum": 12, "gpuMemUtil": "0", "gpuUtil": "0", "index": 0}, {"activeProcessNum": 10, "gpuMemUtil": "0", "gpuUtil": "0", "index": 1}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 2}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 3}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 4}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 5}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 6}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 7}]}
then run the
nnictl --create xxx
No more warning pop out.
Hi @kaijieshi7 , when NNI start an experiment, it will write gpu metric file to
/tmp/matrix/nni/scripts/gpu_metrics
file. Since your log has agpu metric file does no exist
error, I wonder if this file is created correctly. You can useMETRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector
script to test generating gpu_metrics file in you machine.
I met the same question, "gpu_metrics file does not exist!"
and while run the following code directly: METRIC_OUTPUT_DIR=./ python3 -m nni_gpu_tool.gpu_metrics_collector
it replys 'nni_gpu_tool' not found, so after 'ps -ef | grep nni', I found it changed to nni.tools.gpu_tool.gpu_metrics_collector
so I changed it to the following code, and run again: METRIC_OUTPUT_DIR=./ python3 -m nni.tools.gpu_tool.gpu_metrics_collector
but met the following error, I didn't know how to do next.
I also have this issue. How do you fix it?
Hi, all fellows. I ran into this warning recently, my solution is quite easy. (On ubuntu) step1: export the env
export METRIC_OUTPUT_DIR="/home/username/tmp"
step2: create the file by running code:python3 -m nni_gpu_tool.gpu_metrics_collector
after the json output like:{"Timestamp": "Thu Sep 8 13:34:10 2022", "gpuCount": 8, "gpuInfos": [{"activeProcessNum": 12, "gpuMemUtil": "0", "gpuUtil": "0", "index": 0}, {"activeProcessNum": 10, "gpuMemUtil": "0", "gpuUtil": "0", "index": 1}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 2}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 3}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 4}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 5}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 6}, {"activeProcessNum": 1, "gpuMemUtil": "0", "gpuUtil": "0", "index": 7}]}
then run thennictl --create xxx
No more warning pop out.
How does it work? what I got is : python3: Error while finding module specification for 'nni_gpu_tool.gpu_metrics_collector' (ModuleNotFoundError: No module named 'nni_gpu_tool')
Environment:ubuntu 16.04
Log message:
nnimanager.log:
dispatcher.log:
nnictl stdout and stderr:
What issue meet, what's expected?:
How to reproduce it?:
Additional information: