An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
Describe the issue:
When I start an experiment, it stops after one trail and gives the error below. But my code can run on GPU on its own without error, and GPU is available at the time.
Environment:
NNI version: 2.7
Training service (local|remote|pai|aml|etc): local
nnimanager.log:
-[2023-12-28 02:31:12] INFO (main) Start NNI manager
[2023-12-28 02:31:12] INFO (NNIDataStore) Datastore initialization done
[2023-12-28 02:31:12] INFO (RestServer) Starting REST server at port 8088, URL prefix: "/"
[2023-12-28 02:31:12] INFO (RestServer) REST server started.
[2023-12-28 02:31:13] INFO (NNIManager) Starting experiment: wsjofmed
[2023-12-28 02:31:13] INFO (NNIManager) Setup training service...
[2023-12-28 02:31:13] INFO (LocalTrainingService) Construct local machine training service.
[2023-12-28 02:31:13] INFO (NNIManager) Setup tuner...
[2023-12-28 02:31:13] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2023-12-28 02:31:13] INFO (NNIManager) Add event listeners
[2023-12-28 02:31:13] INFO (LocalTrainingService) Run local machine training service.
[2023-12-28 02:31:13] WARNING (GPUScheduler) gpu_metrics file does not exist!
[2023-12-28 02:31:13] INFO (NNIManager) NNIManager received command from dispatcher: ID,
[2023-12-28 02:31:13] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"w_adv": 0.25, "N_steps_D": 5}, "parameter_index": 0}
[2023-12-28 02:31:18] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 0,
hyperParameters: {
value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"w_adv": 0.25, "N_steps_D": 5}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2023-12-28 02:31:28] INFO (NNIManager) Trial job IzTYm status changed from WAITING to RUNNING
[2023-12-28 02:52:33] WARNING (IpcInterface) Commands jammed in buffer!
[2023-12-28 02:52:38] WARNING (IpcInterface) Commands jammed in buffer!
[2023-12-28 02:52:43] WARNING (IpcInterface) Commands jammed in buffer!
This same warning repeats for many times
dispatcher.log:
-[2023-12-28 02:31:12] INFO (nni.experiment/MainThread) Creating experiment, Experiment ID: [36mwsjofmed[0m
[2023-12-28 02:31:12] INFO (nni.experiment/MainThread) Starting web server...
[2023-12-28 02:31:13] INFO (nni.experiment/MainThread) Setting up...
[2023-12-28 02:31:13] INFO (nni.experiment/MainThread) Web portal URLs: [36mhttp://127.0.0.1:8088 http://10.214.163.164:8088[0m
[2023-12-28 02:31:13] INFO (nni.tools.nnictl.launcher/MainThread) To stop experiment run "nnictl stop wsjofmed" or "nnictl stop --all"
[2023-12-28 02:31:13] INFO (nni.tools.nnictl.launcher/MainThread) Reference: https://nni.readthedocs.io/en/stable/Tutorial/Nnictl.html
[2023-12-28 02:31:13] INFO (numexpr.utils/MainThread) Note: NumExpr detected 24 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
[2023-12-28 02:31:13] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads.
[2023-12-28 02:31:13] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2023-12-28 02:31:13] INFO (nni.tuner.gridsearch/Thread-1) Grid initialized, size: (4×3) = 12
[2023-12-28 02:52:27] ERROR (nni.runtime.msg_dispatcher_base/Thread-2) Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Traceback (most recent call last):
File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 88, in command_queue_worker
self.process_command(command, data)
File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 147, in process_command
command_handlerscommand
File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 136, in handle_report_metric_data
data['value'] = load(data['value'])
File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/common/serializer.py", line 401, in load
return json_tricks.loads(string, obj_pairs_hooks=hooks, json_tricks_kwargs)
File "/home/rox/anaconda3/lib/python3.8/site-packages/json_tricks/nonp.py", line 236, in loads
return json_loads(string, object_pairs_hook=hook, jsonkwargs)
File "/home/rox/anaconda3/lib/python3.8/json/init.py", line 370, in loads
return cls(*kw).decode(s)
File "/home/rox/anaconda3/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/rox/anaconda3/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
File "/home/rox/anaconda3/lib/python3.8/site-packages/json_tricks/decoders.py", line 44, in call
map = hook(map, properties=self.properties)
File "/home/rox/anaconda3/lib/python3.8/site-packages/json_tricks/utils.py", line 66, in wrapper
return encoder(args, {k: v for k, v in kwargs.items() if k in names})
File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/common/serializer.py", line 820, in _json_tricks_any_object_decode
return _wrapped_cloudpickle_loads(b)
File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/common/serializer.py", line 826, in _wrapped_cloudpickle_loads
return cloudpickle.loads(b)
File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/storage.py", line 161, in _load_from_bytes
return torch.load(io.BytesIO(b))
File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args)
File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
device = validate_cuda_device(location)
File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
[2023-12-28 02:52:28] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2023-12-28 02:52:31] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
nnictl stdout and stderr:
these two shows no error
Describe the issue: When I start an experiment, it stops after one trail and gives the error below. But my code can run on GPU on its own without error, and GPU is available at the time.
Environment:
Configuration:
Log message:
This same warning repeats for many times
dispatcher.log: -[2023-12-28 02:31:12] INFO (nni.experiment/MainThread) Creating experiment, Experiment ID: [36mwsjofmed[0m [2023-12-28 02:31:12] INFO (nni.experiment/MainThread) Starting web server... [2023-12-28 02:31:13] INFO (nni.experiment/MainThread) Setting up... [2023-12-28 02:31:13] INFO (nni.experiment/MainThread) Web portal URLs: [36mhttp://127.0.0.1:8088 http://10.214.163.164:8088[0m [2023-12-28 02:31:13] INFO (nni.tools.nnictl.launcher/MainThread) To stop experiment run "nnictl stop wsjofmed" or "nnictl stop --all" [2023-12-28 02:31:13] INFO (nni.tools.nnictl.launcher/MainThread) Reference: https://nni.readthedocs.io/en/stable/Tutorial/Nnictl.html [2023-12-28 02:31:13] INFO (numexpr.utils/MainThread) Note: NumExpr detected 24 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. [2023-12-28 02:31:13] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads. [2023-12-28 02:31:13] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2023-12-28 02:31:13] INFO (nni.tuner.gridsearch/Thread-1) Grid initialized, size: (4×3) = 12 [2023-12-28 02:52:27] ERROR (nni.runtime.msg_dispatcher_base/Thread-2) Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. Traceback (most recent call last): File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 88, in command_queue_worker self.process_command(command, data) File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 147, in process_command command_handlerscommand File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 136, in handle_report_metric_data data['value'] = load(data['value']) File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/common/serializer.py", line 401, in load return json_tricks.loads(string, obj_pairs_hooks=hooks, json_tricks_kwargs) File "/home/rox/anaconda3/lib/python3.8/site-packages/json_tricks/nonp.py", line 236, in loads return json_loads(string, object_pairs_hook=hook, jsonkwargs) File "/home/rox/anaconda3/lib/python3.8/json/init.py", line 370, in loads return cls(*kw).decode(s) File "/home/rox/anaconda3/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/rox/anaconda3/lib/python3.8/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) File "/home/rox/anaconda3/lib/python3.8/site-packages/json_tricks/decoders.py", line 44, in call map = hook(map, properties=self.properties) File "/home/rox/anaconda3/lib/python3.8/site-packages/json_tricks/utils.py", line 66, in wrapper return encoder(args, {k: v for k, v in kwargs.items() if k in names}) File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/common/serializer.py", line 820, in _json_tricks_any_object_decode return _wrapped_cloudpickle_loads(b) File "/home/rox/anaconda3/lib/python3.8/site-packages/nni/common/serializer.py", line 826, in _wrapped_cloudpickle_loads return cloudpickle.loads(b) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/storage.py", line 161, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load result = unpickler.load() File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load deserialized_objects[root_key] = restore_location(obj, location) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize device = validate_cuda_device(location) File "/home/rox/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on a CUDA ' RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. [2023-12-28 02:52:28] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting... [2023-12-28 02:52:31] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
nnictl stdout and stderr: these two shows no error