An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
Describe the issue:
I created the trial by nnictl create --config xx --p xxxx
For a while I use nnictl experiment --all to check it, and find it stopped. The dispatcher.log shows the error below.
But the corresponding process is still running in gpu.
btw in the last time I use nni, this error didn't occur. I don't know what caused it.
Environment:
NNI version: 2.10.1
Training service (local|remote|pai|aml|etc): local
nnimanager.log:
[2024-04-12 18:48:34] INFO (main) Start NNI manager
[2024-04-12 18:48:34] INFO (NNIDataStore) Datastore initialization done
[2024-04-12 18:48:34] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
[2024-04-12 18:48:34] INFO (RestServer) REST server started.
[2024-04-12 18:48:35] INFO (NNIManager) Starting experiment: b7edpl94
[2024-04-12 18:48:35] INFO (NNIManager) Setup training service...
[2024-04-12 18:48:35] INFO (LocalTrainingService) Construct local machine training service.
[2024-04-12 18:48:35] INFO (NNIManager) Setup tuner...
[2024-04-12 18:48:35] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2024-04-12 18:48:36] INFO (NNIManager) Add event listeners
[2024-04-12 18:48:36] INFO (LocalTrainingService) Run local machine training service.
[2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: ID,
[2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.0002, "beta1": 0.0001, "beta2": 0.999, "lambda_e": 5e-05}, "parameter_index": 0}
[2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.001, "beta1": 1e-05, "beta2": 0.9, "lambda_e": 5e-05}, "parameter_index": 0}
[2024-04-12 18:48:41] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 0,
hyperParameters: {
value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.0002, "beta1": 0.0001, "beta2": 0.999, "lambda_e": 5e-05}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2024-04-12 18:48:41] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 1,
hyperParameters: {
value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.001, "beta1": 1e-05, "beta2": 0.9, "lambda_e": 5e-05}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2024-04-12 18:48:51] INFO (NNIManager) Trial job ZlXeN status changed from WAITING to RUNNING
[2024-04-12 18:48:51] INFO (NNIManager) Trial job Rh0Pn status changed from WAITING to RUNNING
[2024-04-12 18:49:42] ERROR (tuner_command_channel.WebSocketChannel) Error: Error: tuner_command_channel: Tuner closed connection
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at WebSocket.emit (node:events:538:35)
at WebSocket.emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at Socket.socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at Socket.emit (node:events:526:28)
at TCP. (node:net:687:12)
dispatcher.log:
[2024-04-12 18:48:35] INFO (numexpr.utils/MainThread) Note: NumExpr detected 64 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
[2024-04-12 18:48:35] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads.
[2024-04-12 18:48:36] INFO (nni.tuner.tpe/MainThread) Using random seed 1314744945
[2024-04-12 18:48:36] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2024-04-12 18:49:19] ERROR (nni.runtime.msg_dispatcher_base/Thread-2) Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Traceback (most recent call last):
File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
self.process_command(command, data)
File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
command_handlerscommand
File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 144, in handle_report_metric_data
data['value'] = load(data['value'])
File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 443, in load
return json_tricks.loads(string, obj_pairs_hooks=hooks, json_tricks_kwargs)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/nonp.py", line 259, in loads
return _strip_loads(string, hook, True, jsonkwargs)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/nonp.py", line 266, in _strip_loads
return json_loads(string, object_pairs_hook=object_pairs_hook, jsonkwargs)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/init.py", line 370, in loads
return cls(kw).decode(s)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/decoders.py", line 46, in call
map = hook(map, properties=self.properties)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/utils.py", line 66, in wrapper
return encoder(*args, {k: v for k, v in kwargs.items() if k in names})
File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 877, in _json_tricks_any_object_decode
return _wrapped_cloudpickle_loads(b)
File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 883, in _wrapped_cloudpickle_loads
return cloudpickle.loads(b)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/storage.py", line 161, in _load_from_bytes
return torch.load(io.BytesIO(b))
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
device = validate_cuda_device(location)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
[2024-04-12 18:49:40] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2024-04-12 18:49:42] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
Error: tuner_command_channel: Tuner closed connection
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at WebSocket.emit (node:events:538:35)
at WebSocket.emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at Socket.socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at Socket.emit (node:events:526:28)
at TCP. (node:net:687:12)
Emitted 'error' event at:
at WebSocketChannelImpl.handleError (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:135:22)
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:14)
at WebSocket.emit (node:events:538:35)
[... lines matching original stack trace ...]
at TCP. (node:net:687:12)
Thrown at:
at handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at emit (node:events:538:35)
at emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at emit (node:events:526:28)
at node:net:687:12
Describe the issue: I created the trial by
nnictl create --config xx --p xxxx
For a while I usennictl experiment --all
to check it, and find it stopped. The dispatcher.log shows the error below. But the corresponding process is still running in gpu. btw in the last time I use nni, this error didn't occur. I don't know what caused it.Environment:
Configuration:
Experiment config (remember to remove secrets!): trialCommand: CUDA_VISIBLE_DEVICES=0 python k+1_gan.py trialConcurrency: 2 maxTrialNumber: 1000 maxExperimentDuration: 200h experimentWorkingDirectory: "/home/yiran/codes/Knowledge-Enriched-DMI/nni-experiment" tuner: name: TPE classArgs: optimize_mode: maximize trainingService: platform: local
Search space: { "lr":{"_type":"choice","_value":[0.00005, 0.0001,0.0002, 0.0005, 0.001]}, "beta1":{"_type":"choice","_value":[0.001, 0.0001, 0.00001]}, "beta2": {"_type":"choice","_value":[0.9,0.999]}, "lambda_e": {"_type":"choice","_value":[0.00005]} }
Log message:
nnimanager.log: [2024-04-12 18:48:34] INFO (main) Start NNI manager [2024-04-12 18:48:34] INFO (NNIDataStore) Datastore initialization done [2024-04-12 18:48:34] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/" [2024-04-12 18:48:34] INFO (RestServer) REST server started. [2024-04-12 18:48:35] INFO (NNIManager) Starting experiment: b7edpl94 [2024-04-12 18:48:35] INFO (NNIManager) Setup training service... [2024-04-12 18:48:35] INFO (LocalTrainingService) Construct local machine training service. [2024-04-12 18:48:35] INFO (NNIManager) Setup tuner... [2024-04-12 18:48:35] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2024-04-12 18:48:36] INFO (NNIManager) Add event listeners [2024-04-12 18:48:36] INFO (LocalTrainingService) Run local machine training service. [2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: ID, [2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.0002, "beta1": 0.0001, "beta2": 0.999, "lambda_e": 5e-05}, "parameter_index": 0} [2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.001, "beta1": 1e-05, "beta2": 0.9, "lambda_e": 5e-05}, "parameter_index": 0} [2024-04-12 18:48:41] INFO (NNIManager) submitTrialJob: form: { sequenceId: 0, hyperParameters: { value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.0002, "beta1": 0.0001, "beta2": 0.999, "lambda_e": 5e-05}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2024-04-12 18:48:41] INFO (NNIManager) submitTrialJob: form: { sequenceId: 1, hyperParameters: { value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.001, "beta1": 1e-05, "beta2": 0.9, "lambda_e": 5e-05}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2024-04-12 18:48:51] INFO (NNIManager) Trial job ZlXeN status changed from WAITING to RUNNING [2024-04-12 18:48:51] INFO (NNIManager) Trial job Rh0Pn status changed from WAITING to RUNNING [2024-04-12 18:49:42] ERROR (tuner_command_channel.WebSocketChannel) Error: Error: tuner_command_channel: Tuner closed connection at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26) at WebSocket.emit (node:events:538:35) at WebSocket.emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10) at Socket.socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15) at Socket.emit (node:events:526:28) at TCP. (node:net:687:12)
dispatcher.log: [2024-04-12 18:48:35] INFO (numexpr.utils/MainThread) Note: NumExpr detected 64 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. [2024-04-12 18:48:35] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads. [2024-04-12 18:48:36] INFO (nni.tuner.tpe/MainThread) Using random seed 1314744945 [2024-04-12 18:48:36] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2024-04-12 18:49:19] ERROR (nni.runtime.msg_dispatcher_base/Thread-2) Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. Traceback (most recent call last): File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker self.process_command(command, data) File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command command_handlerscommand File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 144, in handle_report_metric_data data['value'] = load(data['value']) File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 443, in load return json_tricks.loads(string, obj_pairs_hooks=hooks, json_tricks_kwargs) File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/nonp.py", line 259, in loads return _strip_loads(string, hook, True, jsonkwargs) File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/nonp.py", line 266, in _strip_loads return json_loads(string, object_pairs_hook=object_pairs_hook, jsonkwargs) File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/init.py", line 370, in loads return cls(kw).decode(s) File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/decoders.py", line 46, in call map = hook(map, properties=self.properties) File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/utils.py", line 66, in wrapper return encoder(*args, {k: v for k, v in kwargs.items() if k in names}) File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 877, in _json_tricks_any_object_decode return _wrapped_cloudpickle_loads(b) File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 883, in _wrapped_cloudpickle_loads return cloudpickle.loads(b) File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/storage.py", line 161, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args) File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load result = unpickler.load() File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load deserialized_objects[root_key] = restore_location(obj, location) File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize device = validate_cuda_device(location) File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on a CUDA ' RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. [2024-04-12 18:49:40] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting... [2024-04-12 18:49:42] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
nnictl stdout and stderr:
Experiment b7edpl94 start: 2024-04-12 18:48:34.614673
node:events:504 throw er; // Unhandled 'error' event ^
Error: tuner_command_channel: Tuner closed connection at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26) at WebSocket.emit (node:events:538:35) at WebSocket.emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10) at Socket.socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15) at Socket.emit (node:events:526:28) at TCP. (node:net:687:12)
Emitted 'error' event at:
at WebSocketChannelImpl.handleError (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:135:22)
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:14)
at WebSocket.emit (node:events:538:35)
[... lines matching original stack trace ...]
at TCP. (node:net:687:12)
Thrown at:
at handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at emit (node:events:538:35)
at emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at emit (node:events:526:28)
at node:net:687:12
How to reproduce it?: