Mismatched hyperparameters between web server display and their actual values

WenjieDu commented 6 months ago

Describe the issue:

Environment:

NNI version: 3.0
Training service (local|remote|pai|aml|etc): local
Client OS: Ubuntu 20.04.4 LTS (GNU/Linux 5.13.0-30-generic x86_64)
Server OS (for remote mode only):
Python version: 3.11
PyTorch/TensorFlow version: 2.1.2
Is conda/virtualenv/venv used?: Conda
Is running in Docker?: No

Configuration:

Experiment config (remember to remove secrets!):


experimentName: MRNN hyper-param searching
authorName: WenjieDu
trialConcurrency: 1
trainingServicePlatform: local
searchSpacePath: MRNN_ETTm1_tuning_space.json
multiThread: true
useAnnotation: false
tuner:
builtinTunerName: Random

trial: command: enable_tuning=1 pypots-cli tuning --model pypots.imputation.MRNN --train_set ../../data/ettm1/train.h5 --val_set ../../data/ettm1/val.h5 codeDir: . gpuNum: 1

localConfig: useActiveGpu: true maxTrialNumPerGpu: 20 gpuIndices: 3

 - Search space: 
```json
{
  "n_steps":  {"_type":"choice","_value":[60]},
  "n_features":  {"_type":"choice","_value":[7]},
  "patience":  {"_type":"choice","_value":[10]},
  "epochs":  {"_type":"choice","_value":[200]},
  "rnn_hidden_size":  {"_type":"choice","_value":[16,32,64,128,256,512]},
  "lr":{"_type":"loguniform","_value":[0.0001,0.01]}
}

Log message:

nnimanager.log:

[2023-12-27 16:16:42] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 7,
hyperParameters: {
value: '{"parameter_id": 7, "parameter_source": "algorithm", "parameters": {"n_steps": 60, "n_features": 7, "patience": 10, "epochs": 200, "rnn_hidden_size": 32, "lr": 0.0008698020401037771}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2023-12-27 16:16:42] INFO (LocalV3.local) Created trial XsB6F

dispatcher.log:

[2023-12-27 16:15:06] INFO (numexpr.utils/MainThread) Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2023-12-27 16:15:06] INFO (numexpr.utils/MainThread) Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
[2023-12-27 16:15:06] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads.
[2023-12-27 16:15:06] INFO (nni.tuner.random/MainThread) Using random seed 220808582
[2023-12-27 16:15:06] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2023-12-27 16:15:06] INFO (nni.runtime.msg_dispatcher/Thread-1 (command_queue_worker)) Initial search space: {'n_steps': {'_type': 'choice', '_value': [60]}, 'n_features': {'_type': 'choice', '_value': [7]}, 'patience': {'_type': 'choice', '_value': [10]}, 'epochs': {'_type': 'choice', '_value': [200]}, 'rnn_hidden_size': {'_type': 'choice', '_value': [16, 32, 64, 128, 256, 512]}, 'lr': {'_type': 'loguniform', '_value': [0.0001, 0.01]}}

nnictl stdout and stderr:

2023-12-27 16:16:44 [INFO]: Have set the random seed as 2204 for numpy and pytorch.
2023-12-27 16:16:44 [INFO]: The tunner assigns a new group of params: {'n_steps': 60, 'n_features': 7, 'patience': 10, 'epochs': 200, 'rnn_hidden_size': 256, 'lr': 0.0054442307300676335}
2023-12-27 16:16:45 [INFO]: No given device, using default device: cuda
2023-12-27 16:16:45 [WARNING]: ‼️ saving_path not given. Model files and tensorboard file will not be saved.
2023-12-27 16:16:48 [INFO]: MRNN initialized with the given hyperparameters, the number of trainable parameters: 401,619
2023-12-27 16:16:48 [INFO]: Option lazy_load is set as False, hence loading all data from file...
2023-12-27 16:16:52 [INFO]: Epoch 001 - training loss: 1.3847, validating loss: 1.3214

How to reproduce it?:

Note that in the nnimanager.log: lr of trial XsB6F is 0.0008698020401037771 and this is also the value displayed on the local web page, but in the nnictl stdout log, the actual lr received by the model is 0.0054442307300676335, and they're mismatched. This is not a single case, I notice that hyperparameters of some trials are mismatched between the nnimanager tells and their actual values, while some of them are matched and fine.

axinbme commented 6 months ago

I had the same problem.

void-echo commented 4 months ago

Plus one 🤣

WenjieDu commented 1 month ago

Seriously? Nobody takes care of this high-risk issue?

sertreet commented 3 days ago

Plus one，me too

microsoft / nni

Mismatched hyperparameters between web server display and their actual values #5726