microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.05k stars 1.81k forks source link

getting cude_cores:Function Not Found #5540

Open TayyabaZainab0807 opened 1 year ago

TayyabaZainab0807 commented 1 year ago

Describe the issue: The nni process is not running with nni3.0b1. I also tried a more stable nni versions (2.10 and 2.8) I get the following error:

Traceback (most recent call last):
  File "/usr/local/bin/nnictl", line 8, in <module>
    sys.exit(parse_args())
  File "/usr/local/lib/python3.10/dist-packages/nni/tools/nnictl/nnictl.py", line 497, in parse_args
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/nni/tools/nnictl/launcher.py", line 91, in create_experiment
    exp.start(port, debug, RunMode.Detach)
  File "/usr/local/lib/python3.10/dist-packages/nni/experiment/experiment.py", line 135, in start
    self._start_impl(port, debug, run_mode, None, [])
  File "/usr/local/lib/python3.10/dist-packages/nni/experiment/experiment.py", line 94, in _start_impl
    config = self.config.canonical_copy()
  File "/usr/local/lib/python3.10/dist-packages/nni/experiment/config/base.py", line 166, in canonical_copy
    canon._canonicalize([])
  File "/usr/local/lib/python3.10/dist-packages/nni/experiment/config/experiment_config.py", line 121, in _canonicalize
    if algo is not None and algo.name == '_none_':  # type: ignore
AttributeError: 'dict' object has no attribute 'name'

Environment:

Configuration:

maxExperimentDuration: 156h maxTrialNumber: 200 tuner: name: TPE classArgs: optimize_mode: maximize trainingService: platform: local useActiveGpu: True


 - Search space:

{ "en_decoder": { "_type": "choice", "_value": [7,8,9] }, "k1" : { "_type": "choice", "_value": [3,5,7,9,11] }, "k2" : { "_type": "choice", "_value": [3,5,7,9,11] }, "k3" : { "_type": "choice", "_value": [3,5,7,9,11] }, "k4" : { "_type": "choice", "_value": [3,5,7,9,11] }, "k5" : { "_type": "choice", "_value": [3,5,7,9,11] }, "k6" : { "_type": "choice", "_value": [3,5,7,9,11] }, "k7" : { "_type": "choice", "_value": [3,5,7,9,11] }, "k8" : { "_type": "choice", "_value": [3,5,7,9,11] }, "k9" : { "_type": "choice", "_value": [3,5,7,9,11] }, "f1": { "_type": "choice", "_value": [8,16,32] }, "f2": { "_type": "choice", "_value": [8,16,32] }, "f3": { "_type": "choice", "_value": [8,16,32] }, "f4": { "_type": "choice", "_value": [8,16,32] }, "f5": { "_type": "choice", "_value": [8,16,32] }, "f6": { "_type": "choice", "_value": [8,16,32] }, "f7": { "_type": "choice", "_value": [8,16,32] }, "f8": { "_type": "choice", "_value": [8,16,32] }, "f9": { "_type": "choice", "_value": [8,16,32] },

"res_cnn": { "_type": "choice", "_value": [1,2,3] },
"res_f1": { "_type": "choice", "_value": [8,16,32] },
"res_f2": { "_type": "choice", "_value": [8,16,32] },
"res_f3": { "_type": "choice", "_value": [8,16,32] },
"res_k1": { "_type": "choice", "_value": [3,5] },
"res_k2": { "_type": "choice", "_value": [3,5] },
"res_k3": { "_type": "choice", "_value": [3,5] },
"res_drop1": {"_type": "uniform", "_value": [0.1,0.3]},
"res_drop2": {"_type": "uniform", "_value": [0.1,0.3]},
"res_drop3": {"_type": "uniform", "_value": [0.1,0.3]},

"bilstm": { "_type": "choice", "_value": [1,2]},
"u1": { "_type": "choice", "_value": [8,16] },
"u2": { "_type": "choice", "_value": [8,16] },
"drop": {"_type": "uniform", "_value": [0.1,0.3]},

"pu": { "_type": "choice", "_value": [8,16] },
"su": { "_type": "choice", "_value": [8,16] },

"batch_size": { "_type": "choice", "_value": [50,80,100]},
"epochs":{ "_type": "choice", "_value": [10,15,20,25,30] }

}


**Log message**:
 - nnimanager.log:

[2023-05-04 12:04:13] INFO (main) Start NNI manager [2023-05-04 12:04:13] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/" [2023-05-04 12:04:13] INFO (RestServer) REST server started. [2023-05-04 12:04:13] INFO (NNIDataStore) Datastore initialization done [2023-05-04 12:04:14] INFO (NNIManager) Starting experiment: yajeqwud [2023-05-04 12:04:14] INFO (NNIManager) Setup training service... [2023-05-04 12:04:14] INFO (NNIManager) Setup tuner... [2023-05-04 12:04:14] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2023-05-04 12:04:14] INFO (NNIManager) Add event listeners [2023-05-04 12:04:14] INFO (LocalV3.local) Start [2023-05-04 12:04:14] INFO (NNIManager) NNIManager received command from dispatcher: ID, [2023-05-04 12:04:14] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"en_decoder": 8, "k1": 9, "k2": 11, "k3": 5, "k4": 11, "k5": 3, "k6": 5, "k7": 9, "k8": 5, "k9": 7, "f1": 32, "f2": 16, "f3": 16, "f4": 32, "f5": 16, "f6": 16, "f7": 8, "f8": 16, "f9": 16, "res_cnn": 3, "res_f1": 32, "res_f2": 32, "res_f3": 16, "res_k1": 5, "res_k2": 5, "res_k3": 3, "res_drop1": 0.15125745390112305, "res_drop2": 0.21885863079171017, "res_drop3": 0.19313110293876518, "bilstm": 2, "u1": 16, "u2": 8, "drop": 0.2758735965780924, "pu": 8, "su": 16, "batch_size": 80, "epochs": 15}, "parameter_index": 0} [2023-05-04 12:04:15] INFO (NNIManager) submitTrialJob: form: { sequenceId: 0, hyperParameters: { value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"en_decoder": 8, "k1": 9, "k2": 11, "k3": 5, "k4": 11, "k5": 3, "k6": 5, "k7": 9, "k8": 5, "k9": 7, "f1": 32, "f2": 16, "f3": 16, "f4": 32, "f5": 16, "f6": 16, "f7": 8, "f8": 16, "f9": 16, "res_cnn": 3, "res_f1": 32, "res_f2": 32, "res_f3": 16, "res_k1": 5, "res_k2": 5, "res_k3": 3, "res_drop1": 0.15125745390112305, "res_drop2": 0.21885863079171017, "res_drop3": 0.19313110293876518, "bilstm": 2, "u1": 16, "u2": 8, "drop": 0.2758735965780924, "pu": 8, "su": 16, "batch_size": 80, "epochs": 15}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2023-05-04 12:04:15] INFO (GpuInfoCollector) Forced update: { gpuNumber: 1, driverVersion: '470.182.03', cudaVersion: 11060, gpus: [ { index: 0, model: 'NVIDIA A100-SXM4-80GB', gpuMemory: 85198045184, freeGpuMemory: 85197914112, gpuCoreUtilization: 0, gpuMemoryUtilization: 0 } ], processes: [], success: true, failures: [ 'cuda_cores: Function Not Found', 'process: Function Not Found' ] } [2023-05-04 12:04:17] INFO (LocalV3.local) Register directory trial_code = /app

 - dispatcher.log:

[2023-05-04 13:04:14] INFO (nni.tuner.tpe/MainThread) Using random seed 2140802229 [2023-05-04 13:04:14] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2023-05-04 13:04:14] INFO (nni.runtime.msg_dispatcher/Thread-1 (command_queue_worker)) Initial search space: {'en_decoder': {'_type': 'choice', '_value': [7, 8, 9]}, 'k1': {'_type': 'choice', '_value': [3, 5, 7, 9, 11]}, 'k2': {'_type': 'choice', '_value': [3, 5, 7, 9, 11]}, 'k3': {'_type': 'choice', '_value': [3, 5, 7, 9, 11]}, 'k4': {'_type': 'choice', '_value': [3, 5, 7, 9, 11]}, 'k5': {'_type': 'choice', '_value': [3, 5, 7, 9, 11]}, 'k6': {'_type': 'choice', '_value': [3, 5, 7, 9, 11]}, 'k7': {'_type': 'choice', '_value': [3, 5, 7, 9, 11]}, 'k8': {'_type': 'choice', '_value': [3, 5, 7, 9, 11]}, 'k9': {'_type': 'choice', '_value': [3, 5, 7, 9, 11]}, 'f1': {'_type': 'choice', '_value': [8, 16, 32]}, 'f2': {'_type': 'choice', '_value': [8, 16, 32]}, 'f3': {'_type': 'choice', '_value': [8, 16, 32]}, 'f4': {'_type': 'choice', '_value': [8, 16, 32]}, 'f5': {'_type': 'choice', '_value': [8, 16, 32]}, 'f6': {'_type': 'choice', '_value': [8, 16, 32]}, 'f7': {'_type': 'choice', '_value': [8, 16, 32]}, 'f8': {'_type': 'choice', '_value': [8, 16, 32]}, 'f9': {'_type': 'choice', '_value': [8, 16, 32]}, 'res_cnn': {'_type': 'choice', '_value': [1, 2, 3]}, 'res_f1': {'_type': 'choice', '_value': [8, 16, 32]}, 'res_f2': {'_type': 'choice', '_value': [8, 16, 32]}, 'res_f3': {'_type': 'choice', '_value': [8, 16, 32]}, 'res_k1': {'_type': 'choice', '_value': [3, 5]}, 'res_k2': {'_type': 'choice', '_value': [3, 5]}, 'res_k3': {'_type': 'choice', '_value': [3, 5]}, 'res_drop1': {'_type': 'uniform', '_value': [0.1, 0.3]}, 'res_drop2': {'_type': 'uniform', '_value': [0.1, 0.3]}, 'res_drop3': {'_type': 'uniform', '_value': [0.1, 0.3]}, 'bilstm': {'_type': 'choice', '_value': [1, 2]}, 'u1': {'_type': 'choice', '_value': [8, 16]}, 'u2': {'_type': 'choice', '_value': [8, 16]}, 'drop': {'_type': 'uniform', '_value': [0.1, 0.3]}, 'pu': {'_type': 'choice', '_value': [8, 16]}, 'su': {'_type': 'choice', '_value': [8, 16]}, 'batch_size': {'_type': 'choice', '_value': [50, 80, 100]}, 'epochs': {'_type': 'choice', '_value': [10, 15, 20, 25, 30]}} [2023-05-04 13:05:14] ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to receive command. Retry in 0s Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/protocol.py", line 968, in transfer_data message = await self.read_message() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/protocol.py", line 1038, in read_message frame = await self.read_data_frame(max_size=self.max_size) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/protocol.py", line 1113, in read_data_frame frame = await self.read_frame(max_size) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/protocol.py", line 1170, in read_frame frame = await Frame.read( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/framing.py", line 69, in read data = await reader(2) File "/usr/lib/python3.10/asyncio/streams.py", line 708, in readexactly await self._wait_for_data('readexactly') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 99, in _receive_command command = conn.receive() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 103, in receive msg = _wait(self._ws.recv()) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/protocol.py", line 568, in recv await self.ensure_open() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/protocol.py", line 953, in ensure_open raise self.connection_closed_exc() websockets.exceptions.ConnectionClosedError: sent 1011 (unexpected error) keepalive ping timeout; no close frame received [2023-05-04 13:05:34] ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to receive command. Retry in 1s Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in __await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 98, in _receive_command conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError [2023-05-04 13:05:55] ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to receive command. Retry in 2s Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 98, in _receive_command conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError [2023-05-04 13:06:17] ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to receive command. Retry in 3s Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 98, in _receive_command conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError [2023-05-04 13:06:40] ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to receive command. Retry in 4s Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 98, in _receive_command conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError [2023-05-04 13:06:44] WARNING (nni.runtime.command_channel.websocket.channel/MainThread) Failed to receive command. Last retry [2023-05-04 13:07:04] INFO (nni.runtime.msg_dispatcher_base/MainThread) Report error to NNI manager: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/main.py", line 61, in main dispatcher.run() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/msg_dispatcher_base.py", line 69, in run command, data = self._channel._receive() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/tuner_command_channel/channel.py", line 270, in _receive command = self._channel.receive() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 59, in receive command = self._receive_command() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 108, in _receive_command conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.__await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError

[2023-05-04 13:07:04] ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to send command. Retry in 0s Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in __await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/main.py", line 61, in main dispatcher.run() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/msg_dispatcher_base.py", line 69, in run command, data = self._channel._receive() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/tuner_command_channel/channel.py", line 270, in _receive command = self._channel.receive() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 59, in receive command = self._receive_command() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 108, in _receive_command conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.__await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 45, in send conn.send(command) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 90, in send _wait(self._ws.send(nni.dump(message))) AttributeError: 'NoneType' object has no attribute 'send' [2023-05-04 13:07:24] ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to send command. Retry in 1s Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in __await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 44, in send conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError [2023-05-04 13:07:46] ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to send command. Retry in 2s Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 44, in send conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError [2023-05-04 13:08:08] ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to send command. Retry in 3s Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 44, in send conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError [2023-05-04 13:08:31] ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to send command. Retry in 4s Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 44, in send conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError [2023-05-04 13:08:35] WARNING (nni.runtime.command_channel.websocket.channel/MainThread) Failed to send command {'type': 'ER', 'content': 'Traceback (most recent call last):\n File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in await_impl\n await protocol.handshake(\n File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake\n status_code, response_headers = await self.read_http_response()\n File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response\n status_code, reason, headers = await read_response(self.reader)\n File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response\n status_line = await read_line(stream)\n File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line\n line = await stream.readline()\n File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline\n line = await self.readuntil(sep)\n File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil\n await self._wait_for_data(\'readuntil\')\n File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data\n await self._waiter\nasyncio.exceptions.CancelledError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for\n return fut.result()\nasyncio.exceptions.CancelledError\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File "/usr/local/lib/python3.10/dist-packages/nni/main.py", line 61, in main\n dispatcher.run()\n File "/usr/local/lib/python3.10/dist-packages/nni/runtime/msg_dispatcher_base.py", line 69, in run\n command, data = self._channel._receive()\n File "/usr/local/lib/python3.10/dist-packages/nni/runtime/tuner_command_channel/channel.py", line 270, in _receive\n command = self._channel.receive()\n File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 59, in receive\n command = self._receive_command()\n File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 108, in _receive_command\n conn = self._ensure_conn()\n File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn\n self._conn.connect()\n File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect\n self._ws = _wait(_connect_async(self._url))\n File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait\n return future.result()\n File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result\n return self.get_result()\n File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result\n raise self._exception\n File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async\n return await websockets.connect(url, max_size=None) # type: ignore\n File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout\n return await asyncio.wait_for(self.await_impl(), self.open_timeout)\n File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for\n raise exceptions.TimeoutError() from exc\nasyncio.exceptions.TimeoutError\n'}. Last retry [2023-05-04 13:08:55] ERROR (nni.runtime.msg_dispatcher_base/MainThread) Connection to NNI manager is broken. Failed to report error. [2023-05-04 13:08:55] ERROR (nni.main/MainThread) Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/nni/main.py", line 85, in main() File "/usr/local/lib/python3.10/dist-packages/nni/main.py", line 61, in main dispatcher.run() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/msg_dispatcher_base.py", line 69, in run command, data = self._channel._receive() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/tuner_command_channel/channel.py", line 270, in _receive command = self._channel.receive() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 59, in receive command = self._receive_command() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 108, in _receive_command conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.__await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError

 - nnictl stdout and stderr:

Experiment yajeqwud start: 2023-05-04 13:04:12.999020

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 666, in __await_impl__ await protocol.handshake( File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/usr/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/usr/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/nni/main.py", line 85, in main() File "/usr/local/lib/python3.10/dist-packages/nni/main.py", line 61, in main dispatcher.run() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/msg_dispatcher_base.py", line 69, in run command, data = self._channel._receive() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/tuner_command_channel/channel.py", line 270, in _receive command = self._channel.receive() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 59, in receive command = self._receive_command() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 108, in _receive_command conn = self._ensure_conn() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn self._conn.connect() File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect self._ws = _wait(_connect_async(self._url)) File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.__await_impl__(), self.open_timeout) File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError


--------------------------------------------------------------------------------
Experiment yajeqwud start: 2023-05-04 13:04:12.999020
--------------------------------------------------------------------------------

How to reproduce it?:

liuzhe-lz commented 1 year ago

Please pin typeguard to v2.x with pip install 'typeguard<3', or upgrade NNI to v3.0 test version with pip install --extra-index-url https://test.pypi.org/simple/ nni==3.0b1.

TayyabaZainab0807 commented 1 year ago

Please pin typeguard to v2.x with pip install 'typeguard<3',

nni2.10 with typegurad<3 works but nni2.x has this issue https://github.com/microsoft/nni/issues/5531 so If I move to nni3.0b1, it still gives me failures: [ 'cuda_cores: Function Not Found', 'process: Function Not Found' ] with latest typeguard

cruiseliu commented 1 year ago

Please provide the version of nvidia-ml-py (pip list), nvidia driver, and cuda.

The error should be reproducible with following script, please check its output.

from pynvml import *
nvmlInit()
device = nvmlDeviceGetHandleByIndex(0)
cuda_cores = nvmlDeviceGetNumGpuCores(device)
print(cuda_cores)
nvmlShutdown()
cruiseliu commented 1 year ago

Seems relative to this issue: https://github.com/NVIDIA/k8s-device-plugin/issues/331 They suggest to upgrade nvidia driver.

cruiseliu commented 1 year ago

After some investigation I found the real error is another one. I will push a fix later today.

TayyabaZainab0807 commented 1 year ago
cuda_cores = nvmlDeviceGetNumGpuCores(device)

I have this version for nvidia-ml-py = 11.525.112

While running this script I get this error pynvml.NVMLError_FunctionNotFound: Function Not Found

cruiseliu commented 1 year ago

Please try out 3.0b2 The NVML error is non-critical and can be ignored.

Lijiaoa commented 1 year ago

any updates for it? @TayyabaZainab0807