microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.04k stars 1.81k forks source link

ValueError: Out of range float values are not JSON compliant #1589

Closed martsalz closed 4 years ago

martsalz commented 5 years ago

Short summary about the issue/question:

When executing an experiment, the following error message appears for some trials:

ValueError: Out of range float values are not JSON compliant

What's the reason for this?

nni Environment:

Anything else we need to know:

stderr:

Using TensorFlow backend.
Traceback (most recent call last):
  File "test.py", line 164, in <module>
    train(ARGS, RECEIVED_PARAMS)
  File "test.py", line 136, in train
    validation_data=(x_test, y_test), callbacks=[SendMetrics(), TensorBoard(log_dir=TENSORBOARD_DIR)])
  File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/engine/training.py", line 1178, in fit
    validation_freq=validation_freq)
  File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/engine/training_arrays.py", line 224, in fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/callbacks.py", line 152, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "test.py", line 119, in on_epoch_end
    nni.report_intermediate_result(logs["val_loss"])
  File "/home/msalz/nni/build/nni/trial.py", line 81, in report_intermediate_result
    'value': metric
  File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/json_tricks/nonp.py", line 99, in dumps
    primitives=primitives, fallback_encoders=fallback_encoders, **jsonkwargs).encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
ValueError: Out of range float values are not JSON compliant
authorName: default
experimentName: example_mnist
maxExecDuration: 2h
maxTrialNum: 1000
trialConcurrency: 2
localConfig:
    useActiveGpu: true
    maxTrialNumPerGpu: 2
#choice: local, remote, pai
trainingServicePlatform: local
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
advisor:
  #choice: Hyperband
  builtinAdvisorName: BOHB
  classArgs:
    min_budget: 4
    max_budget: 32
    #eta: proportion of discarded trials
    eta: 2
    #choice: maximize, minimize
    optimize_mode: minimize
trial:
  command: python3 test.py
  codeDir: .
  gpuNum: 1

image

chicm-ms commented 5 years ago

@martsalz Hi, can you log the value of logs["val_loss"] ? I suspect that the value of logs["val_loss"] is out of range for the failed trail. Sometimes if learning rate is too large, the loss value could be 'nan'.

martsalz commented 4 years ago

Because the experiments take a few hours/days and the error message occurs sporadically in my opinion, the reproducibility is not that easy.....

martsalz commented 4 years ago

@chicm-ms Yes, logs["val_loss"] returns the value nan:

`[11/22/2019, 12:05:09 PM] PRINT - ETA: 0s - loss: nan [11/22/2019, 12:05:09 PM] PRINT 499/500 [============================>.] [11/22/2019, 12:05:09 PM] PRINT - ETA: 0s - loss: nan [11/22/2019, 12:05:13 PM] PRINT 500/500 [==============================] [11/22/2019, 12:05:13 PM] PRINT - 118s 236ms/step - loss: nan - val_loss: nan

[11/22/2019, 12:05:13 PM] ERROR (mnist_keras/MainThread) Out of range float values are not JSON compliant Traceback (most recent call last): File "test.py", line 76, in train(ARGS, RECEIVED_PARAMS) File "test.py", line 63, in train validation_steps=val_steps) File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, kwargs) File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/engine/training.py", line 1658, in fit_generator initial_epoch=initial_epoch) File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/engine/training_generator.py", line 255, in fit_generator callbacks.on_epoch_end(epoch, epoch_logs) File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/keras/callbacks.py", line 152, in on_epoch_end callback.on_epoch_end(epoch, logs) File "test.py", line 41, in on_epoch_end nni.report_intermediate_result(logs["val_loss"]) File "/home/msalz/nni/build/nni/trial.py", line 84, in report_intermediate_result 'value': metric File "/home/msalz/venv_nni_dev/lib64/python3.6/site-packages/json_tricks/nonp.py", line 99, in dumps primitives=primitives, fallback_encoders=fallback_encoders, jsonkwargs).encode(obj) File "/usr/lib64/python3.6/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) ValueError: Out of range float values are not JSON compliant `

martsalz commented 4 years ago

How can this bug be fixed quickly? In my experiment with 173trainings, 65of them failed due to this error. With 8h/model very frustrating.

chicm-ms commented 4 years ago

We are trying to fix this with PR https://github.com/microsoft/nni/pull/1958

Too large learning rate can lead to nan loss value, a quick fix is to check your trial code / search space and set learning rate to a smaller value.

Since the loss value of the failed jobs are nan, the hyper parameter of those jobs won't be the best even if they are not failed.

chicm-ms commented 4 years ago

Closing this issue since the problem is fixed in nni v1.4. @martsalz , you can check our latest nni version.