keras-team / autokeras

AutoML library for deep learning
http://autokeras.com/
Apache License 2.0
9.11k stars 1.39k forks source link

tuner.get_best_models crash #1261

Closed perceptualJonathan closed 3 years ago

perceptualJonathan commented 4 years ago

Bug Description

Bug Reproduction

Code for reproducing the bug:

import autokeras as ak

# Initialize the classifier.
maxTrials = 30
clf = ak.StructuredDataClassifier(max_trials=maxTrials, overwrite=True)
# x is the path to the csv file. y is the column name of the column to predict.
clf.fit(x=Path/To/Train, y='survived')
# Evaluate the accuracy of the found model.
evald = clf.evaluate(x=Path/To/Test, y='survived')
print('Accuracy: {accuracy}'.format(accuracy=evald))

foundModels = clf.tuner.get_best_models(maxTrials)

#Error pitched:
ValueError                                Traceback (most recent call last)
<ipython-input-2-74d6da8b7a7b> in <module>
----> 1 found_models = clf.tuner.get_best_models(30)

/usr/local/lib/python3.7/site-packages/kerastuner/engine/tuner.py in get_best_models(self, num_models)
    256         """
    257         # Method only exists in this class for the docstring override.
--> 258         return super(Tuner, self).get_best_models(num_models)
    259 
    260     def _deepcopy_callbacks(self, callbacks):

/usr/local/lib/python3.7/site-packages/kerastuner/engine/base_tuner.py in get_best_models(self, num_models)
    238         """
    239         best_trials = self.oracle.get_best_trials(num_models)
--> 240         models = [self.load_model(trial) for trial in best_trials]
    241         return models
    242 

/usr/local/lib/python3.7/site-packages/kerastuner/engine/base_tuner.py in <listcomp>(.0)
    238         """
    239         best_trials = self.oracle.get_best_trials(num_models)
--> 240         models = [self.load_model(trial) for trial in best_trials]
    241         return models
    242 

/usr/local/lib/python3.7/site-packages/kerastuner/engine/tuner.py in load_model(self, trial)
    182         with hm_module.maybe_distribute(self.distribution_strategy):
    183             model.load_weights(self._get_checkpoint_fname(
--> 184                 trial.trial_id, best_epoch))
    185         return model
    186 

/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in load_weights(self, filepath, by_name, skip_mismatch, options)
   2174     else:
   2175       try:
-> 2176         py_checkpoint_reader.NewCheckpointReader(filepath)
   2177         save_format = 'tf'
   2178       except errors_impl.DataLossError:

/usr/local/lib/python3.7/site-packages/tensorflow/python/training/py_checkpoint_reader.py in NewCheckpointReader(filepattern)
     93   """
     94   try:
---> 95     return CheckpointReader(compat.as_bytes(filepattern))
     96   # TODO(b/143319754): Remove the RuntimeError casting logic once we resolve the
     97   # issue with throwing python exceptions from C++.

ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./structured_data_classifier/trial_2ff8f5343d67a1cf10c42a77eb2bd081/checkpoints/epoch_33/checkpoint: Not found: ./structured_data_classifier/trial_2ff8f5343d67a1cf10c42a77eb2bd081/checkpoints/epoch_33; No such file or directory

Data used by the code:

Expected Behavior

Setup Details

Include the details about the versions of:

Additional context

perceptualJonathan commented 4 years ago

In looking at a few runs of this snippet, it looks like there is an off-by-1 error somewhere in the model loading procedure. Consistently, I see that if it is epoch X that is supposed to be loaded, and there is no checkpoint at epoch X, there is a checkpoint at epoch X+1. However, after account for that issue, I encounter a different error when trying to make predictions. If my test set is xTest, then calling

models = clf.tuner.get_best_models(MAXTRIALS)

for model in models:
    model.predict(xTest)

yields the error

---------------------------------------------------------------------------
UnimplementedError                        Traceback (most recent call last)
<ipython-input-5-c18752c5786e> in <module>
      2 
      3 for model in models:
----> 4     model.predict(xTest)

/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
    128       raise ValueError('{} is not supported in multi-worker mode.'.format(
    129           method.__name__))
--> 130     return method(self, *args, **kwargs)
    131 
    132   return tf_decorator.make_decorator(

/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in predict(self, x, batch_size, verbose, steps, callbacks, max_queue_size, workers, use_multiprocessing)
   1597           for step in data_handler.steps():
   1598             callbacks.on_predict_batch_begin(step)
-> 1599             tmp_batch_outputs = predict_function(iterator)
   1600             if data_handler.should_sync:
   1601               context.async_wait()

/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
    778       else:
    779         compiler = "nonXla"
--> 780         result = self._call(*args, **kwds)
    781 
    782       new_tracing_count = self._get_tracing_count()

/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
    844               *args, **kwds)
    845       # If we did not create any variables the trace we have is good enough.
--> 846       return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
    847 
    848     def fn_with_cond(*inner_args, **inner_kwds):

/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py in _filtered_call(self, args, kwargs, cancellation_manager)
   1846                            resource_variable_ops.BaseResourceVariable))],
   1847         captured_inputs=self.captured_inputs,
-> 1848         cancellation_manager=cancellation_manager)
   1849 
   1850   def _call_flat(self, args, captured_inputs, cancellation_manager=None):

/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1922       # No tape is watching; skip to running the function.
   1923       return self._build_call_outputs(self._inference_function.call(
-> 1924           ctx, args, cancellation_manager=cancellation_manager))
   1925     forward_backward = self._select_forward_and_backward_functions(
   1926         args,

/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
    548               inputs=args,
    549               attrs=attrs,
--> 550               ctx=ctx)
    551         else:
    552           outputs = execute.execute_with_cancellation(

/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

UnimplementedError:  Cast double to string is not supported
     [[node functional_1/Cast (defined at <ipython-input-5-c18752c5786e>:4) ]] [Op:__inference_predict_function_9685]

Function call stack:
predict_function
stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

perceptualJonathan commented 3 years ago

@haifeng-jin any suggestions? Thanks.

haifeng-jin commented 3 years ago

I really don't know. Would you provide your code in colab? So that I can reproduce the error?

perceptualJonathan commented 3 years ago

If you take the snippet I put at the beginning and use whatever dataset you want (I used the Titanic dataset in this snippet), that should be sufficient to reproduce the error.

haifeng-jin commented 3 years ago

OK. I got why this error exists. I have submitted a PR to keras tuner. https://github.com/keras-team/keras-tuner/pull/424 It is for fixing the no such file or directory error.

And for the error of UnimplementedError: Cast double to string is not supported, you can try AutoKeras 1.0.10. If it still exists, please cast your data as a numpy array with dtype string before passing it to the predict function.

It should solve the problem.

perceptualJonathan commented 3 years ago

@haifeng-jin thanks again. Unfortunately, it looks like there's still an issue kicking around. After running


from tensorflow.keras.callbacks import Callback, EarlyStopping, ModelCheckpoint
from tensorflow.python.platform import tf_logging as logging
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, cohen_kappa_score, matthews_corrcoef, log_loss
import matplotlib.pyplot as plt, pandas as pd, numpy as np, matplotlib as mpl
import requests, time, re, os, subprocess, json, sys, datetime

import autokeras as ak
import keras.backend as K

from IPython.display import clear_output
%matplotlib inline

x, y = load_breast_cancer(return_X_y=True)
xTrain, xTest, yTrain, yTest = train_test_split(x, y, random_state=0)

MAXTRIALS = 600
inputNode = ak.StructuredDataInput()
outputNode = ak.StructuredDataBlock(categorical_encoding=True)(inputNode)
outputNode = ak.ClassificationHead(num_classes=2)(outputNode)

clf = ak.AutoModel(
    inputNode, 
    outputNode, 
    overwrite=True,
    objective="val_loss",
    max_trials=MAXTRIALS)

clf.fit(xTrain, yTrain) #metrics=[matthewsCorrelation],

predictedClasses = clf.predict(xTest)

cm = confusion_matrix(yTest, predictedClasses)

print(cm)

matthew = matthews_corrcoef(yTest, predictedClasses)

print('MCC:', matthew)

I run

models = clf.tuner.get_best_models(num_models=66) #Or whatever the actual number of trials is
len(models)

and this yields the error

WARNING:tensorflow:Layer multi_category_encoding is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because its dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

WARNING:tensorflow:Layer multi_category_encoding is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because its dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.beta_1
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.beta_2
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.decay
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.learning_rate
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py in NewCheckpointReader(filepattern)
     94   try:
---> 95     return CheckpointReader(compat.as_bytes(filepattern))
     96   # TODO(b/143319754): Remove the RuntimeError casting logic once we resolve the

RuntimeError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./auto_model/trial_3a5d4cfec8ed6d7bd8418be44da0b7af/checkpoints/epoch_5/checkpoint

During handling of the above exception, another exception occurred:

NotFoundError                             Traceback (most recent call last)
<ipython-input-4-36dfe1721b50> in <module>
----> 1 models = clf.tuner.get_best_models(num_models=10)
      2 len(models)

/usr/local/lib/python3.8/site-packages/kerastuner/engine/tuner.py in get_best_models(self, num_models)
    263         """
    264         # Method only exists in this class for the docstring override.
--> 265         return super(Tuner, self).get_best_models(num_models)
    266 
    267     def _deepcopy_callbacks(self, callbacks):

/usr/local/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py in get_best_models(self, num_models)
    238         """
    239         best_trials = self.oracle.get_best_trials(num_models)
--> 240         models = [self.load_model(trial) for trial in best_trials]
    241         return models
    242 

/usr/local/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py in <listcomp>(.0)
    238         """
    239         best_trials = self.oracle.get_best_trials(num_models)
--> 240         models = [self.load_model(trial) for trial in best_trials]
    241         return models
    242 

/usr/local/lib/python3.8/site-packages/kerastuner/engine/tuner.py in load_model(self, trial)
    188         best_epoch = trial.best_step
    189         with hm_module.maybe_distribute(self.distribution_strategy):
--> 190             model.load_weights(self._get_checkpoint_fname(
    191                 trial.trial_id, best_epoch))
    192         return model

/usr/local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in load_weights(self, filepath, by_name, skip_mismatch, options)
   2174     else:
   2175       try:
-> 2176         py_checkpoint_reader.NewCheckpointReader(filepath)
   2177         save_format = 'tf'
   2178       except errors_impl.DataLossError:

/usr/local/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py in NewCheckpointReader(filepattern)
     97   # issue with throwing python exceptions from C++.
     98   except RuntimeError as e:
---> 99     error_translator(e)

/usr/local/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py in error_translator(e)
     33       'Failed to find any '
     34       'matching files for') in error_message:
---> 35     raise errors_impl.NotFoundError(None, None, error_message)
     36   elif 'Sliced checkpoints are not supported' in error_message or (
     37       'Data type '

NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./auto_model/trial_3a5d4cfec8ed6d7bd8418be44da0b7af/checkpoints/epoch_5/checkpoint

Now, I thought this was just the off-by-1 error from before, but when I looked for the file, there is an "epoch_5" folder, but it's strangely empty. The other folder of epochs have checkpoint files.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.