Closed martsalz closed 4 years ago
@martsalz Hi, can you log the value of logs["val_loss"] ? I suspect that the value of logs["val_loss"] is out of range for the failed trail. Sometimes if learning rate is too large, the loss value could be 'nan'.
Because the experiments take a few hours/days and the error message occurs sporadically in my opinion, the reproducibility is not that easy.....
@chicm-ms Yes, logs["val_loss"] returns the value nan
:
`[11/22/2019, 12:05:09 PM] PRINT - ETA: 0s - loss: nan [11/22/2019, 12:05:09 PM] PRINT 499/500 [============================>.] [11/22/2019, 12:05:09 PM] PRINT - ETA: 0s - loss: nan [11/22/2019, 12:05:13 PM] PRINT 500/500 [==============================] [11/22/2019, 12:05:13 PM] PRINT - 118s 236ms/step - loss: nan - val_loss: nan
[11/22/2019, 12:05:13 PM] ERROR (mnist_keras/MainThread) Out of range float values are not JSON compliant
Traceback (most recent call last):
File "test.py", line 76, in
How can this bug be fixed quickly? In my experiment with 173
trainings, 65
of them failed due to this error. With 8h/model very frustrating.
We are trying to fix this with PR https://github.com/microsoft/nni/pull/1958
Too large learning rate can lead to nan loss value, a quick fix is to check your trial code / search space and set learning rate to a smaller value.
Since the loss value of the failed jobs are nan
, the hyper parameter of those jobs won't be the best even if they are not failed.
Closing this issue since the problem is fixed in nni v1.4. @martsalz , you can check our latest nni version.
Short summary about the issue/question:
When executing an experiment, the following error message appears for some trials:
ValueError: Out of range float values are not JSON compliant
What's the reason for this?
nni Environment:
Anything else we need to know:
stderr: