microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.02k stars 1.81k forks source link

Infinite "collect_gpu_info" and gpu not found #5574

Open chachus opened 1 year ago

chachus commented 1 year ago

Describe the issue: During the Hello NAS tutorial for v3.0rc1, when i launch the experiment as described a number of issues arise:

  1. Even though i edit exp.config.trial_gpu_number = 1 and exp.config.training_service.use_active_gpu = True, the logger prints "no gpu found, edit exp.config.trial_gpu_number.
  2. The computer slows down considerably, probably because is training using the CPU and not the GPU as indicated. Also checking the system monitor i can se lots of instances of "collect_gpu_info", here is the screenshot: screenshot. i don't know if it is the intended behavior.
  3. Killing the experiment doesn't stop all this processes.

Environment:

Log message:

How to reproduce it?: Hello NAS! tutorial for v3.0rc01

ultmaster commented 1 year ago

Looks like a bug from GPU metric collector. @liuzhe-lz could you take a look?

liuzhe-lz commented 1 year ago

Please try the script directly (python -m nni.tools.nni_manager_scripts.collect_gpu_info) and tell us the output.

chachus commented 1 year ago

Sure, this is the output: {"gpuNumber": 1, "gpus": [{"index": 0, "gpuCoreUtilization": 0.01, "gpuMemoryUtilization": 0.05}], "processes": [{"pid": 2520, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 315121664}, {"pid": 2664, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 160632832}, {"pid": 3158, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 55185408}, {"pid": 6158, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 383524864}, {"pid": 6495, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 4526080}, {"pid": 10007, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 93855744}], "success": true}

liuzhe-lz commented 1 year ago

What's the content of ~/nni-experiments/<EXP-ID>/log/nnimanager.log? Seems collect_gpu_info is working well.

chachus commented 1 year ago

Sorry i probabily forgot to link the files in the issue. Here the logs: experiment.log nnimanager.log

If i use the tutorial what happens is:

[2023-06-02 10:47:01] Config is not provided. Will try to infer. 
[2023-06-02 10:47:01] Using execution engine based on training service. Trial concurrency is set to 1.
[2023-06-02 10:47:01] Using simplified model format. 
[2023-06-02 10:47:01] Using local training service. 
[2023-06-02 10:47:01] WARNING: GPU found but will not be used. Please set experiment.config.trial_gpu_numberto the number of GPUs you want to use for each trial. 
[2023-06-02 10:47:01] Creating experiment, Experiment ID: 1v85b07z [2023-06-02 10:47:02] Starting web server... 
[2023-06-02 10:47:05] ERROR: Create experiment failed: HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbe3cb8a860>: Failed to establish a new connection: [Errno 111] Connection refused'))

The warning GPU found but not used appears even though trial_gpu_number is set to 1. After this the pc slows down because the process is not killed even after the error (the connection port results still occupied), and leaves a ton of zombie processes of collect_gpu_info, as reported.

ferqui commented 1 year ago

Hi, I have the same issue as you, when launch NNI v3.0rc1 an infinite amount of collect_gpu_info process. I tried it on different PC, linux and Windows and all have the same problem. Is there a solution for it? Thanks,

chachus commented 1 year ago

Not that i know at the moment. I'm waiting the next release hoping for a fix.

chachus commented 1 year ago

Please try the script directly (python -m nni.tools.nni_manager_scripts.collect_gpu_info) and tell us the output.

I saw that the version 3.0 has been published, but i still encounter this bug. Any help?

levisocool commented 11 months ago

Also encountered the same question on Hello NAS tutorial for v3.0 (21/8/2023 Latest on Sep 14 ). But if I run on CPU, all performed well, with open-abled web URL as well as 3 succeed trails. The log was on company computer, which is not accessing to internet. So i just type the ERRPR log i saw on the log: ... INFO (nni.nas.experiment.experiment) Experiment initialized successfully. Starting exploration strategy... ERROR (nni.nas.strategy.base) Strategy failed to execute. ERROR(Thread-5 (listen):nni.runtime.command_channel.websocket.channel) Failed to receive command. Retry in 0s ...

kiramt commented 8 months ago

I have this problem too with NNI version 3.0. In my case NNI looks like it is running as I get no errors:

$ nnictl create --config nni_config.yaml --port 8001
[2024-02-15 13:38:12] Creating experiment, Experiment ID: 20efndaz
[2024-02-15 13:38:12] Starting web server...
[2024-02-15 13:38:13] Setting up...
[2024-02-15 13:38:13] Web portal URLs: http://127.0.0.1:8001 http://172.17.0.4:8001
[2024-02-15 13:38:13] To stop experiment run "nnictl stop 20efndaz" or "nnictl stop --all"
[2024-02-15 13:38:13] Reference: https://nni.readthedocs.io/en/stable/reference/nnictl.html

but all that happens is my server fills up with nni.tools.nni_manager_scripts.collect_gpu_info processes which I have to kill.

If I run using CPUs it seems to be fine.

I'm using tensorflow on debian, but I've also tried in an ubuntu docker image and get the same result.