microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
13.88k stars 1.81k forks source link

MY Trial keeps on failing. #5792

Closed karmad84 closed 4 weeks ago

karmad84 commented 4 weeks ago

I am trying to run a NAS implementation and all of the trial (trial jobs) are seen as failed. The dispatcher sends hyperparameter configurations for different neural network architectures to the NNI manager one by one. The manager creates trials (M0YoL, gNQ3c, etc.) with these configurations and submits them to the LocalV3.local service for execution. Each trial ends with a "FAILED" status. Unfortunately, the logs don't reveal the specific reason for the failures. New hyperparameter sets are received: Even though previous trials failed, the dispatcher keeps sending new hyperparameter configurations for the NNI manager to try.

Here is a snippet from the NNI manager log: [2024-06-06 12:31:17] INFO (main) Start NNI manager [2024-06-06 12:31:17] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/" [2024-06-06 12:31:17] INFO (RestServer) REST server started. [2024-06-06 12:31:17] INFO (NNIDataStore) Datastore initialization done [2024-06-06 12:31:17] INFO (NNIManager) Starting experiment: ldkwaep5 [2024-06-06 12:31:17] INFO (NNIManager) Setup training service... [2024-06-06 12:31:17] INFO (NNIManager) Setup tuner... [2024-06-06 12:31:17] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2024-06-06 12:31:18] INFO (NNIManager) Add event listeners [2024-06-06 12:31:18] INFO (LocalV3.local) Start [2024-06-06 12:31:18] INFO (NNIManager) NNIManager received command from dispatcher: ID, [2024-06-06 12:31:18] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"num_layers": 3, "layer1_type": "LSTM", "layer1_units": 128, "layer2_type": "GRU", "layer2_units": 128, "layer3_type": "LSTM", "layer3_units": 32, "dropout_rate": 0.5}, "parameter_index": 0} [2024-06-06 12:31:19] INFO (NNIManager) submitTrialJob: form: { sequenceId: 0, hyperParameters: { value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"num_layers": 3, "layer1_type": "LSTM", "layer1_units": 128, "layer2_type": "GRU", "layer2_units": 128, "layer3_type": "LSTM", "layer3_units": 32, "dropout_rate": 0.5}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2024-06-06 12:31:19] INFO (LocalV3.local) Register directory trial_code = /home/kd/nni/second_nas [2024-06-06 12:31:19] INFO (LocalV3.local) Created trial M0YoL [2024-06-06 12:31:20] INFO (LocalV3.local) Trial parameter: M0YoL {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"num_layers": 3, "layer1_type": "LSTM", "layer1_units": 128, "layer2_type": "GRU", "layer2_units": 128, "layer3_type": "LSTM", "layer3_units": 32, "dropout_rate": 0.5}, "parameter_index": 0} [2024-06-06 12:31:21] INFO (NNIManager) Trial job M0YoL status changed from RUNNING to FAILED [2024-06-06 12:31:21] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"num_layers": 3, "layer1_type": "GRU", "layer1_units": 64, "layer2_type": "GRU", "layer2_units": 64, "layer3_type": "LSTM", "layer3_units": 32, "dropout_rate": 0.5}, "parameter_index": 0} [2024-06-06 12:31:21] INFO (NNIManager) submitTrialJob: form: { sequenceId: 1, hyperParameters: { value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"num_layers": 3, "layer1_type": "GRU", "layer1_units": 64, "layer2_type": "GRU", "layer2_units": 64, "layer3_type": "LSTM", "layer3_units": 32, "dropout_rate": 0.5}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2024-06-06 12:31:21] INFO (LocalV3.local) Created trial gNQ3c [2024-06-06 12:31:22] INFO (LocalV3.local) Trial parameter: gNQ3c {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"num_layers": 3, "layer1_type": "GRU", "layer1_units": 64, "layer2_type": "GRU", "layer2_units": 64, "layer3_type": "LSTM", "layer3_units": 32, "dropout_rate": 0.5}, "parameter_index": 0} [2024-06-06 12:31:23] INFO (NNIManager) Trial job gNQ3c status changed from RUNNING to FAILED [2024-06-06 12:31:23] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 2, "parameter_source": "algorithm", "parameters": {"num_layers": 3, "layer1_type": "LSTM", "layer1_units": 128, "layer2_type": "LSTM", "layer2_units": 128, "layer3_type": "GRU", "layer3_units": 128, "dropout_rate": 0.5}, "parameter_index": 0} [2024-06-06 12:31:24] INFO (NNIManager) submitTrialJob: form: { sequenceId: 2, hyperParameters: { value: '{"parameter_id": 2, "parameter_source": "algorithm", "parameters": {"num_layers": 3, "layer1_type": "LSTM", "layer1_units": 128, "layer2_type": "LSTM", "layer2_units": 128, "layer3_type": "GRU", "layer3_units": 128, "dropout_rate": 0.5}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2024-06-06 12:31:24] INFO (LocalV3.local) Created trial MBlBa [2024-06-06 12:31:25] INFO (LocalV3.local) Trial parameter: MBlBa {"parameter_id": 2, "parameter_source": "algorithm", "parameters": {"num_layers": 3, "layer1_type": "LSTM", "layer1_units": 128, "layer2_type": "LSTM", "layer2_units": 128, "layer3_type": "GRU", "layer3_units": 128, "dropout_rate": 0.5}, "parameter_index": 0} [2024-06-06 12:31:26] INFO (NNIManager) Trial job MBlBa status changed from RUNNING to FAILED [2024-06-06 12:31:26] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 3, "parameter_source": "algorithm", "parameters": {"num_layers": 2, "layer1_type": "GRU", "layer1_units": 64, "layer2_type": "LSTM", "layer2_units": 32, "layer3_type": "LSTM", "layer3_units": 128, "dropout_rate": 0.8}, "parameter_index": 0}

And the contents of the Dispatcher log: [2024-06-06 12:31:18] INFO (numexpr.utils/MainThread) Note: NumExpr detected 20 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. [2024-06-06 12:31:18] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads. [2024-06-06 12:31:18] INFO (nni.tuner.tpe/MainThread) Using random seed 1280300711 [2024-06-06 12:31:18] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2024-06-06 12:31:18] INFO (nni.runtime.msg_dispatcher/Thread-1 (command_queue_worker)) Initial search space: {'num_layers': {'_type': 'choice', '_value': [1, 2, 3]}, 'layer1_type': {'_type': 'choice', '_value': ['LSTM', 'GRU']}, 'layer1_units': {'_type': 'choice', '_value': [32, 64, 128]}, 'layer2_type': {'_type': 'choice', '_value': ['LSTM', 'GRU']}, 'layer2_units': {'_type': 'choice', '_value': [32, 64, 128]}, 'layer3_type': {'_type': 'choice', '_value': ['LSTM', 'GRU']}, 'layer3_units': {'_type': 'choice', '_value': [32, 64, 128]}, 'dropout_rate': {'_type': 'choice', '_value': [0.2, 0.5, 0.8]}}