microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
13.99k stars 1.81k forks source link

New trails are not being initiated after one ends #5531

Closed TayyabaZainab0807 closed 1 year ago

TayyabaZainab0807 commented 1 year ago

Describe the issue: The new trails are not being created/started after one ends. It started with two processes and after one or both were ended, no new process started.

Environment:

Configuration:

maxExperimentDuration: 156h maxTrialNumber: 200 tuner: name: TPE classArgs: optimize_mode: maximize trainingService: platform: local useActiveGpu: True GpuIndices: 0,2

liuzhe-lz commented 1 year ago

Could you try out the v3.0 test version? It can be installed with pip install --extra-index-url https://test.pypi.org/simple/ nni==3.0b1

And please upload the nnimanager.log. It can be viewed from web portal or be found in nni-experiments/<experiment-id>/log.

TayyabaZainab0807 commented 1 year ago

And please upload the nnimanager.log. It can be viewed from web portal or be found in nni-experiments/<experiment-id>/log.

for the 2.10 version: (i stoped the trail and tried to resume it -- but still did start new trails)

[2023-04-19 10:15:31] INFO (main) Start NNI manager [2023-04-19 10:15:31] INFO (NNIDataStore) Datastore initialization done [2023-04-19 10:15:31] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/" [2023-04-19 10:15:31] INFO (RestServer) REST server started. [2023-04-19 10:15:32] INFO (NNIManager) Starting experiment: mxnhse5p [2023-04-19 10:15:32] INFO (NNIManager) Setup training service... [2023-04-19 10:15:32] INFO (LocalTrainingService) Construct local machine training service. [2023-04-19 10:15:32] INFO (NNIManager) Setup tuner... [2023-04-19 10:15:32] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2023-04-19 10:15:32] INFO (NNIManager) Add event listeners [2023-04-19 10:15:32] INFO (LocalTrainingService) Run local machine training service. [2023-04-19 10:15:32] INFO (NNIManager) NNIManager received command from dispatcher: ID, [2023-04-19 10:15:32] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-04-19 10:15:32] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"en_decoder": 7, "k1": 3, "k2": 7, "k3": 11, "k4": 11, "k5": 7, "k6": 9, "k7": 3, "k8": 5, "k9": 5, "f1": 16, "f2": 32, "f3": 16, "f4": 8, "f5": 16, "f6": 32, "f7": 32, "f8": 16, "f9": 16, "res_cnn": 2, "res_f1": 16, "res_f2": 32, "res_f3": 8, "res_k1": 3, "res_k2": 3, "res_k3": 3, "res_drop1": 0.2525122934118338, "res_drop2": 0.13917639281890776, "res_drop3": 0.19210184998292734, "bilstm": 2, "u1": 16, "u2": 8, "drop": 0.22140827745640773, "pu": 16, "su": 16, "batch_size": 80, "epochs": 30}, "parameter_index": 0} [2023-04-19 10:15:32] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"en_decoder": 9, "k1": 11, "k2": 7, "k3": 3, "k4": 7, "k5": 11, "k6": 11, "k7": 3, "k8": 9, "k9": 5, "f1": 8, "f2": 16, "f3": 16, "f4": 16, "f5": 16, "f6": 16, "f7": 8, "f8": 8, "f9": 32, "res_cnn": 1, "res_f1": 16, "res_f2": 32, "res_f3": 8, "res_k1": 5, "res_k2": 3, "res_k3": 3, "res_drop1": 0.2516317871306879, "res_drop2": 0.1584086594988994, "res_drop3": 0.1867130439612617, "bilstm": 2, "u1": 16, "u2": 16, "drop": 0.14852204390631057, "pu": 8, "su": 16, "batch_size": 50, "epochs": 30}, "parameter_index": 0} [2023-04-19 10:15:37] INFO (NNIManager) submitTrialJob: form: { sequenceId: 0, hyperParameters: { value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"en_decoder": 7, "k1": 3, "k2": 7, "k3": 11, "k4": 11, "k5": 7, "k6": 9, "k7": 3, "k8": 5, "k9": 5, "f1": 16, "f2": 32, "f3": 16, "f4": 8, "f5": 16, "f6": 32, "f7": 32, "f8": 16, "f9": 16, "res_cnn": 2, "res_f1": 16, "res_f2": 32, "res_f3": 8, "res_k1": 3, "res_k2": 3, "res_k3": 3, "res_drop1": 0.2525122934118338, "res_drop2": 0.13917639281890776, "res_drop3": 0.19210184998292734, "bilstm": 2, "u1": 16, "u2": 8, "drop": 0.22140827745640773, "pu": 16, "su": 16, "batch_size": 80, "epochs": 30}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2023-04-19 10:15:37] INFO (NNIManager) submitTrialJob: form: { sequenceId: 1, hyperParameters: { value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"en_decoder": 9, "k1": 11, "k2": 7, "k3": 3, "k4": 7, "k5": 11, "k6": 11, "k7": 3, "k8": 9, "k9": 5, "f1": 8, "f2": 16, "f3": 16, "f4": 16, "f5": 16, "f6": 16, "f7": 8, "f8": 8, "f9": 32, "res_cnn": 1, "res_f1": 16, "res_f2": 32, "res_f3": 8, "res_k1": 5, "res_k2": 3, "res_k3": 3, "res_drop1": 0.2516317871306879, "res_drop2": 0.1584086594988994, "res_drop3": 0.1867130439612617, "bilstm": 2, "u1": 16, "u2": 16, "drop": 0.14852204390631057, "pu": 8, "su": 16, "batch_size": 50, "epochs": 30}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2023-04-19 10:15:47] INFO (NNIManager) Trial job cYxQ1 status changed from WAITING to RUNNING [2023-04-19 10:15:47] INFO (NNIManager) Trial job NEkVg status changed from WAITING to RUNNING [2023-04-25 22:28:28] INFO (NNIManager) Change NNIManager status from: RUNNING to: NO_MORE_TRIAL [2023-04-26 02:29:52] INFO (NNIManager) Trial job NEkVg status changed from RUNNING to SUCCEEDED [2023-04-29 08:06:38] INFO (NNIManager) Trial job cYxQ1 status changed from RUNNING to SUCCEEDED [2023-04-29 08:06:38] INFO (NNIManager) Change NNIManager status from: NO_MORE_TRIAL to: DONE [2023-04-29 08:06:38] INFO (NNIManager) Experiment done. [2023-04-29 11:09:37] INFO (ShutdownManager) Initiate shutdown: SIGTERM [2023-04-29 11:09:37] INFO (RestServer) Stopping REST server. [2023-04-29 11:09:37] INFO (NNIManager) Change NNIManager status from: DONE to: STOPPING [2023-04-29 11:09:37] INFO (NNIManager) Stopping experiment, cleaning up ... [2023-04-29 11:09:39] INFO (RestServer) REST server stopped. [2023-04-29 11:09:39] INFO (LocalTrainingService) Stopping local machine training service... [2023-04-29 11:09:39] INFO (NNIManager) Change NNIManager status from: STOPPING to: STOPPED [2023-04-29 11:09:39] INFO (NNIManager) Experiment stopped. [2023-04-29 11:09:39] INFO (NNITensorboardManager) Forced stopping all tensorboard task. [2023-04-29 11:09:39] INFO (NNITensorboardManager) All tensorboard task stopped. [2023-04-29 11:09:39] INFO (NNITensorboardManager) Tensorboard manager stopped. [2023-04-29 11:09:39] INFO (ShutdownManager) Shutdown complete. [2023-04-29 11:10:17] INFO (main) Start NNI manager [2023-04-29 11:10:17] INFO (NNIDataStore) Datastore initialization done [2023-04-29 11:10:17] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/" [2023-04-29 11:10:17] INFO (RestServer) REST server started. [2023-04-29 11:10:18] INFO (NNIManager) Resuming experiment: mxnhse5p [2023-04-29 11:10:23] INFO (NNIManager) Setup training service... [2023-04-29 11:10:23] INFO (LocalTrainingService) Construct local machine training service. [2023-04-29 11:10:23] INFO (NNIManager) Setup tuner... [2023-04-29 11:10:23] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2023-04-29 11:10:24] INFO (NNIManager) Add event listeners [2023-04-29 11:10:24] INFO (LocalTrainingService) Run local machine training service. [2023-04-29 11:10:24] INFO (NNIManager) Change NNIManager status from: RUNNING to: NO_MORE_TRIAL [2023-04-29 11:10:24] INFO (NNIManager) Change NNIManager status from: NO_MORE_TRIAL to: DONE [2023-04-29 11:10:24] INFO (NNIManager) NNIManager received command from dispatcher: ID, [2023-04-29 11:10:24] WARNING (GPUScheduler) gpu_metrics file does not exist! [2023-04-29 11:10:24] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"en_decoder": 8, "k1": 11, "k2": 3, "k3": 9, "k4": 3, "k5": 3, "k6": 3, "k7": 3, "k8": 3, "k9": 3, "f1": 16, "f2": 32, "f3": 8, "f4": 32, "f5": 32, "f6": 16, "f7": 32, "f8": 32, "f9": 8, "res_cnn": 1, "res_f1": 16, "res_f2": 8, "res_f3": 32, "res_k1": 5, "res_k2": 3, "res_k3": 3, "res_drop1": 0.15977192966058312, "res_drop2": 0.2840966589269582, "res_drop3": 0.22745375033819032, "bilstm": 1, "u1": 8, "u2": 8, "drop": 0.10778863307545457, "pu": 16, "su": 16, "batch_size": 80, "epochs": 25}, "parameter_index": 0} [2023-04-29 11:10:24] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 2, "parameter_source": "algorithm", "parameters": {"en_decoder": 9, "k1": 3, "k2": 11, "k3": 5, "k4": 9, "k5": 9, "k6": 9, "k7": 11, "k8": 7, "k9": 11, "f1": 8, "f2": 32, "f3": 16, "f4": 32, "f5": 8, "f6": 8, "f7": 32, "f8": 8, "f9": 16, "res_cnn": 1, "res_f1": 16, "res_f2": 16, "res_f3": 32, "res_k1": 5, "res_k2": 3, "res_k3": 5, "res_drop1": 0.17883635203038534, "res_drop2": 0.260807767493224, "res_drop3": 0.12370037989344801, "bilstm": 1, "u1": 16, "u2": 8, "drop": 0.14551183581312965, "pu": 8, "su": 8, "batch_size": 50, "epochs": 30}, "parameter_index": 0} [2023-04-29 11:10:24] INFO (NNIManager) Experiment done.

TayyabaZainab0807 commented 1 year ago

It can be installed with pip install --extra-index-url https://test.pypi.org/simple/ nni==3.0b1

My trials will take up to 5 to 6 days to complete but for test 1 initiated two process and canceled one of them, the new process started after canceling

[2023-04-29 11:27:11] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/" [2023-04-29 11:27:11] INFO (RestServer) REST server started. [2023-04-29 11:27:11] INFO (NNIDataStore) Datastore initialization done [2023-04-29 11:27:11] INFO (NNIManager) Starting experiment: 6a78jsrd [2023-04-29 11:27:12] INFO (NNIManager) Setup training service... [2023-04-29 11:27:12] INFO (NNIManager) Setup tuner... [2023-04-29 11:27:12] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2023-04-29 11:27:12] INFO (NNIManager) Add event listeners [2023-04-29 11:27:12] INFO (LocalV3.local) Start [2023-04-29 11:27:12] INFO (NNIManager) NNIManager received command from dispatcher: ID, [2023-04-29 11:27:12] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"en_decoder": 7, "k1": 3, "k2": 11, "k3": 7, "k4": 3, "k5": 5, "k6": 9, "k7": 11, "k8": 9, "k9": 9, "f1": 16, "f2": 8, "f3": 16, "f4": 16, "f5": 8, "f6": 8, "f7": 16, "f8": 32, "f9": 32, "res_cnn": 2, "res_f1": 32, "res_f2": 16, "res_f3": 16, "res_k1": 3, "res_k2": 5, "res_k3": 5, "res_drop1": 0.2499935274247463, "res_drop2": 0.250394492016827, "res_drop3": 0.2203186542998259, "bilstm": 2, "u1": 8, "u2": 8, "drop": 0.23949113932108115, "pu": 16, "su": 16, "batch_size": 80, "epochs": 25}, "parameter_index": 0} [2023-04-29 11:27:12] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"en_decoder": 8, "k1": 3, "k2": 7, "k3": 3, "k4": 7, "k5": 7, "k6": 9, "k7": 7, "k8": 9, "k9": 11, "f1": 32, "f2": 8, "f3": 8, "f4": 16, "f5": 8, "f6": 32, "f7": 32, "f8": 16, "f9": 16, "res_cnn": 2, "res_f1": 16, "res_f2": 8, "res_f3": 32, "res_k1": 3, "res_k2": 3, "res_k3": 5, "res_drop1": 0.21285547621424344, "res_drop2": 0.2414724447753369, "res_drop3": 0.15161235302701628, "bilstm": 1, "u1": 16, "u2": 16, "drop": 0.1257707483889307, "pu": 8, "su": 8, "batch_size": 80, "epochs": 30}, "parameter_index": 0} [2023-04-29 11:27:13] INFO (NNIManager) submitTrialJob: form: { sequenceId: 0, hyperParameters: { value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"en_decoder": 7, "k1": 3, "k2": 11, "k3": 7, "k4": 3, "k5": 5, "k6": 9, "k7": 11, "k8": 9, "k9": 9, "f1": 16, "f2": 8, "f3": 16, "f4": 16, "f5": 8, "f6": 8, "f7": 16, "f8": 32, "f9": 32, "res_cnn": 2, "res_f1": 32, "res_f2": 16, "res_f3": 16, "res_k1": 3, "res_k2": 5, "res_k3": 5, "res_drop1": 0.2499935274247463, "res_drop2": 0.250394492016827, "res_drop3": 0.2203186542998259, "bilstm": 2, "u1": 8, "u2": 8, "drop": 0.23949113932108115, "pu": 16, "su": 16, "batch_size": 80, "epochs": 25}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2023-04-29 11:27:13] INFO (GpuInfoCollector) Forced update: { gpuNumber: 4, driverVersion: '530.30.02', cudaVersion: 12010, gpus: [ { index: 0, model: 'NVIDIA TITAN X (Pascal)', cudaCores: 3584, gpuMemory: 12884901888, freeGpuMemory: 12779847680, gpuCoreUtilization: 0, gpuMemoryUtilization: 0 }, { index: 1, model: 'NVIDIA TITAN X (Pascal)', cudaCores: 3584, gpuMemory: 12884901888, freeGpuMemory: 12779847680, gpuCoreUtilization: 0, gpuMemoryUtilization: 0 }, { index: 2, model: 'NVIDIA TITAN Xp', cudaCores: 3840, gpuMemory: 12884901888, freeGpuMemory: 12779847680, gpuCoreUtilization: 0, gpuMemoryUtilization: 0 }, { index: 3, model: 'NVIDIA TITAN Xp', cudaCores: 3840, gpuMemory: 12884901888, freeGpuMemory: 12779847680, gpuCoreUtilization: 0, gpuMemoryUtilization: 0 } ], processes: [], success: true } [2023-04-29 11:27:15] INFO (LocalV3.local) Register directory trial_code = /home/tza/EQmodel-nni [2023-04-29 11:27:15] INFO (LocalV3.local) Created trial fuooO [2023-04-29 11:27:15] INFO (NNIManager) submitTrialJob: form: { sequenceId: 1, hyperParameters: { value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"en_decoder": 8, "k1": 3, "k2": 7, "k3": 3, "k4": 7, "k5": 7, "k6": 9, "k7": 7, "k8": 9, "k9": 11, "f1": 32, "f2": 8, "f3": 8, "f4": 16, "f5": 8, "f6": 32, "f7": 32, "f8": 16, "f9": 16, "res_cnn": 2, "res_f1": 16, "res_f2": 8, "res_f3": 32, "res_k1": 3, "res_k2": 3, "res_k3": 5, "res_drop1": 0.21285547621424344, "res_drop2": 0.2414724447753369, "res_drop3": 0.15161235302701628, "bilstm": 1, "u1": 16, "u2": 16, "drop": 0.1257707483889307, "pu": 8, "su": 8, "batch_size": 80, "epochs": 30}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2023-04-29 11:27:15] INFO (LocalV3.local) Created trial VhJIe [2023-04-29 11:27:18] INFO (LocalV3.local) Trial parameter: VhJIe {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"en_decoder": 7, "k1": 3, "k2": 11, "k3": 7, "k4": 3, "k5": 5, "k6": 9, "k7": 11, "k8": 9, "k9": 9, "f1": 16, "f2": 8, "f3": 16, "f4": 16, "f5": 8, "f6": 8, "f7": 16, "f8": 32, "f9": 32, "res_cnn": 2, "res_f1": 32, "res_f2": 16, "res_f3": 16, "res_k1": 3, "res_k2": 5, "res_k3": 5, "res_drop1": 0.2499935274247463, "res_drop2": 0.250394492016827, "res_drop3": 0.2203186542998259, "bilstm": 2, "u1": 8, "u2": 8, "drop": 0.23949113932108115, "pu": 16, "su": 16, "batch_size": 80, "epochs": 25}, "parameter_index": 0} [2023-04-29 11:27:18] INFO (LocalV3.local) Trial parameter: fuooO {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"en_decoder": 8, "k1": 3, "k2": 7, "k3": 3, "k4": 7, "k5": 7, "k6": 9, "k7": 7, "k8": 9, "k9": 11, "f1": 32, "f2": 8, "f3": 8, "f4": 16, "f5": 8, "f6": 32, "f7": 32, "f8": 16, "f9": 16, "res_cnn": 2, "res_f1": 16, "res_f2": 8, "res_f3": 32, "res_k1": 3, "res_k2": 3, "res_k3": 5, "res_drop1": 0.21285547621424344, "res_drop2": 0.2414724447753369, "res_drop3": 0.15161235302701628, "bilstm": 1, "u1": 16, "u2": 16, "drop": 0.1257707483889307, "pu": 8, "su": 8, "batch_size": 80, "epochs": 30}, "parameter_index": 0} [2023-04-29 11:27:37] INFO (NNIManager) User cancelTrialJob: fuooO [2023-04-29 11:27:37] INFO (LocalV3.local) Stop trial fuooO [2023-04-29 11:27:40] INFO (NNIManager) Trial job fuooO status changed from RUNNING to USER_CANCELED [2023-04-29 11:27:40] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 2, "parameter_source": "algorithm", "parameters": {"en_decoder": 9, "k1": 5, "k2": 11, "k3": 9, "k4": 11, "k5": 3, "k6": 3, "k7": 3, "k8": 9, "k9": 9, "f1": 8, "f2": 16, "f3": 8, "f4": 32, "f5": 8, "f6": 32, "f7": 16, "f8": 32, "f9": 16, "res_cnn": 3, "res_f1": 32, "res_f2": 8, "res_f3": 32, "res_k1": 5, "res_k2": 3, "res_k3": 5, "res_drop1": 0.15890780187656867, "res_drop2": 0.1398222691951549, "res_drop3": 0.12740633444038738, "bilstm": 2, "u1": 16, "u2": 8, "drop": 0.2601178544936399, "pu": 16, "su": 16, "batch_size": 80, "epochs": 10}, "parameter_index": 0} [2023-04-29 11:27:41] INFO (NNIManager) submitTrialJob: form: { sequenceId: 2, hyperParameters: { value: '{"parameter_id": 2, "parameter_source": "algorithm", "parameters": {"en_decoder": 9, "k1": 5, "k2": 11, "k3": 9, "k4": 11, "k5": 3, "k6": 3, "k7": 3, "k8": 9, "k9": 9, "f1": 8, "f2": 16, "f3": 8, "f4": 32, "f5": 8, "f6": 32, "f7": 16, "f8": 32, "f9": 16, "res_cnn": 3, "res_f1": 32, "res_f2": 8, "res_f3": 32, "res_k1": 5, "res_k2": 3, "res_k3": 5, "res_drop1": 0.15890780187656867, "res_drop2": 0.1398222691951549, "res_drop3": 0.12740633444038738, "bilstm": 2, "u1": 16, "u2": 8, "drop": 0.2601178544936399, "pu": 16, "su": 16, "batch_size": 80, "epochs": 10}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2023-04-29 11:27:41] INFO (LocalV3.local) Created trial rmrL4 [2023-04-29 11:27:44] INFO (LocalV3.local) Trial parameter: rmrL4 {"parameter_id": 2, "parameter_source": "algorithm", "parameters": {"en_decoder": 9, "k1": 5, "k2": 11, "k3": 9, "k4": 11, "k5": 3, "k6": 3, "k7": 3, "k8": 9, "k9": 9, "f1": 8, "f2": 16, "f3": 8, "f4": 32, "f5": 8, "f6": 32, "f7": 16, "f8": 32, "f9": 16, "res_cnn": 3, "res_f1": 32, "res_f2": 8, "res_f3": 32, "res_k1": 5, "res_k2": 3, "res_k3": 5, "res_drop1": 0.15890780187656867, "res_drop2": 0.1398222691951549, "res_drop3": 0.12740633444038738, "bilstm": 2, "u1": 16, "u2": 8, "drop": 0.2601178544936399, "pu": 16, "su": 16, "batch_size": 80, "epochs": 10}, "parameter_index": 0}

TayyabaZainab0807 commented 1 year ago

pip install --extra-index-url https://test.pypi.org/simple/ nni==3.0b1

It solved the issue.

Lijiaoa commented 1 year ago

I'm glad to hear that your problem has been solved. Could you close this issue? @TayyabaZainab0807