Dispatcher terminiated reporting cuda unavailable issue

Describe the issue: When I start an experiment, it stops after one trail and gives the error below. But my code can run on GPU on its own without error, and GPU is available at the time.

Environment:

NNI version: 2.7
Training service (local|remote|pai|aml|etc): local
Client OS:
Server OS (for remote mode only):
Python version: 3.8
PyTorch/TensorFlow version: pytorch 1.10.0+cu113
Is conda/virtualenv/venv used?: no
Is running in Docker?: no

Configuration:

Experiment config (remember to remove secrets!): trialConcurrency: 1 maxExecDuration: 7d trainingServicePlatform: local localConfig: maxTrialNumPerGpu: 1 useActiveGpu: true useAnnotation: false tuner: builtinTunerName: GridSearch
Search space: { "w_adv":{"_type":"choice","_value":[0.25,0.5,0.75,1]}, "N_steps_D":{"_type":"choice","_value":[5,10,15]} }

Log message:

nnimanager.log: -[2023-12-28 02:31:12] INFO (main) Start NNI manager [2023-12-28 02:31:12] INFO (NNIDataStore) Datastore initialization done [2023-12-28 02:31:12] INFO (RestServer) Starting REST server at port 8088, URL prefix: "/" [2023-12-28 02:31:12] INFO (RestServer) REST server started. [2023-12-28 02:31:13] INFO (NNIManager) Starting experiment: wsjofmed [2023-12-28 02:31:13] INFO (NNIManager) Setup training service... [2023-12-28 02:31:13] INFO (LocalTrainingService) Construct local machine training service. [2023-12-28 02:31:13] INFO (NNIManager) Setup tuner... [2023-12-28 02:31:13] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2023-12-28 02:31:13] INFO (NNIManager) Add event listeners [2023-12-28 02:31:13] INFO (LocalTrainingService) Run local machine training service. [2023-12-28 02:31:13] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-12-28 02:31:13] INFO (NNIManager) NNIManager received command from dispatcher: ID, [2023-12-28 02:31:13] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"w_adv": 0.25, "N_steps_D": 5}, "parameter_index": 0} [2023-12-28 02:31:18] INFO (NNIManager) submitTrialJob: form: { sequenceId: 0, hyperParameters: { value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"w_adv": 0.25, "N_steps_D": 5}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2023-12-28 02:31:28] INFO (NNIManager) Trial job IzTYm status changed from WAITING to RUNNING [2023-12-28 02:52:33] WARNING (IpcInterface) Commands jammed in buffer! [2023-12-28 02:52:38] WARNING (IpcInterface) Commands jammed in buffer! [2023-12-28 02:52:43] WARNING (IpcInterface) Commands jammed in buffer!

This same warning repeats for many times

dispatcher.log: -[2023-12-28 02:31:12] INFO (nni.experiment/MainThread) Creating experiment, Experiment ID: [36mwsjofmed[0m [2023-12-28 02:31:12] INFO (nni.experiment/MainThread) Starting web server... [2023-12-28 02:31:13] INFO (nni.experiment/MainThread) Setting up... [2023-12-28 02:31:13] INFO (nni.experiment/MainThread) Web portal URLs: [36mhttp://127.0.0.1:8088 http://10.214.163.164:8088[0m [2023-12-28 02:31:13] INFO (nni.tools.nnictl.launcher/MainThread) To stop experiment run "nnictl stop wsjofmed" or "nnictl stop --all" [2023-12-28 02:31:13] INFO (nni.tools.nnictl.launcher/MainThread) Reference: https://nni.readthedocs.io/en/stable/Tutorial/Nnictl.html [2023-12-28 02:31:13] INFO (numexpr.utils/MainThread) Note: NumExpr detected 24 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. [2023-12-28 02:31:13] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads. [2023-12-28 02:31:13] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2023-12-28 02:31:13] INFO (nni.tuner.gridsearch/Thread-1) Grid initialized, size: (4×3) = 12 [2023-12-28 02:52:27] ERROR (nni.runtime.msg_dispatcher_base/Thread-2) Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. Traceback (most recent call last): File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 88, in command_queue_worker self.process_command(command, data) File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 147, in process_command command_handlerscommand File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 136, in handle_report_metric_data data['value'] = load(data['value']) File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/common/serializer.py", line 401, in load return json_tricks.loads(string, obj_pairs_hooks=hooks, json_tricks_kwargs) File "/home/rox/anaconda3/lib/python3.8/site-packages/json_tricks/nonp.py", line 236, in loads return json_loads(string, object_pairs_hook=hook, jsonkwargs) File "/home/rox/anaconda3/lib/python3.8/json/init.py", line 370, in loads return cls(*kw).decode(s) File "/home/rox/anaconda3/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/rox/anaconda3/lib/python3.8/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) File "/home/rox/anaconda3/lib/python3.8/site-packages/json_tricks/decoders.py", line 44, in call map = hook(map, properties=self.properties) File "/home/rox/anaconda3/lib/python3.8/site-packages/json_tricks/utils.py", line 66, in wrapper return encoder(args, {k: v for k, v in kwargs.items() if k in names}) File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/common/serializer.py", line 820, in _json_tricks_any_object_decode return _wrapped_cloudpickle_loads(b) File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/common/serializer.py", line 826, in _wrapped_cloudpickle_loads return cloudpickle.loads(b) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/storage.py", line 161, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load result = unpickler.load() File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load deserialized_objects[root_key] = restore_location(obj, location) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize device = validate_cuda_device(location) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on a CUDA ' RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. [2023-12-28 02:52:28] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting... [2023-12-28 02:52:31] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
nnictl stdout and stderr: these two shows no error

microsoft / nni

Dispatcher terminiated reporting cuda unavailable issue #5727