keras-team / autokeras

AutoML library for deep learning
http://autokeras.com/
Apache License 2.0
9.15k stars 1.4k forks source link

Example code not working - MPG example #1163

Closed KirkDCO closed 3 years ago

KirkDCO commented 4 years ago

Bug Description

Trying to get started using AutoKeras and finding that most of the example code does not work.

Bug Reproduction

Running the example here: https://autokeras.com/tutorial/structured_data_regression/

Setup Details

Include the details about the versions of:

Error


ValueError Traceback (most recent call last)

in 5 # Evaluate the accuracy of the found model. 6 print('Accuracy: {accuracy}'.format( ----> 7 accuracy=regressor.evaluate(x=test_dataset.drop(columns=['MPG']), y=test_dataset['MPG']))) ~/opt/miniconda2/envs/AutoML/lib/python3.8/site-packages/autokeras-1.0.3-py3.8.egg/autokeras/tasks/structured_data.py in evaluate(self, x, y, batch_size, **kwargs) 133 if isinstance(x, str): 134 x, y = self._read_from_csv(x, y) --> 135 return super().evaluate(x=x, 136 y=y, 137 batch_size=batch_size, ~/opt/miniconda2/envs/AutoML/lib/python3.8/site-packages/autokeras-1.0.3-py3.8.egg/autokeras/auto_model.py in evaluate(self, x, y, **kwargs) 443 """ 444 dataset = self._process_xy(x, y, False) --> 445 return self.tuner.get_best_model().evaluate(x=dataset, **kwargs) 446 447 def export_model(self): ~/opt/miniconda2/envs/AutoML/lib/python3.8/site-packages/autokeras-1.0.3-py3.8.egg/autokeras/engine/tuner.py in get_best_model(self) 43 44 def get_best_model(self): ---> 45 model = super().get_best_models()[0] 46 model.load_weights(self.best_model_path) 47 return model ~/opt/miniconda2/envs/AutoML/lib/python3.8/site-packages/kerastuner/engine/tuner.py in get_best_models(self, num_models) 229 """ 230 # Method only exists in this class for the docstring override. --> 231 return super(Tuner, self).get_best_models(num_models) 232 233 def _deepcopy_callbacks(self, callbacks): ~/opt/miniconda2/envs/AutoML/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py in get_best_models(self, num_models) 236 """ 237 best_trials = self.oracle.get_best_trials(num_models) --> 238 models = [self.load_model(trial) for trial in best_trials] 239 return models 240 ~/opt/miniconda2/envs/AutoML/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py in (.0) 236 """ 237 best_trials = self.oracle.get_best_trials(num_models) --> 238 models = [self.load_model(trial) for trial in best_trials] 239 return models 240 ~/opt/miniconda2/envs/AutoML/lib/python3.8/site-packages/kerastuner/engine/tuner.py in load_model(self, trial) 154 best_epoch = trial.best_step 155 with hm_module.maybe_distribute(self.distribution_strategy): --> 156 model.load_weights(self._get_checkpoint_fname( 157 trial.trial_id, best_epoch)) 158 return model ~/opt/miniconda2/envs/AutoML/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in load_weights(self, filepath, by_name, skip_mismatch) 248 raise ValueError('Load weights is not yet supported with TPUStrategy ' 249 'with steps_per_run greater than 1.') --> 250 return super(Model, self).load_weights(filepath, by_name, skip_mismatch) 251 252 def compile(self, ~/opt/miniconda2/envs/AutoML/lib/python3.8/site-packages/tensorflow/python/keras/engine/network.py in load_weights(self, filepath, by_name, skip_mismatch) 1229 else: 1230 try: -> 1231 py_checkpoint_reader.NewCheckpointReader(filepath) 1232 save_format = 'tf' 1233 except errors_impl.DataLossError: ~/opt/miniconda2/envs/AutoML/lib/python3.8/site-packages/tensorflow/python/training/py_checkpoint_reader.py in NewCheckpointReader(filepattern) 93 """ 94 try: ---> 95 return CheckpointReader(compat.as_bytes(filepattern)) 96 # TODO(b/143319754): Remove the RuntimeError casting logic once we resolve the 97 # issue with throwing python exceptions from C++. ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./structured_data_regressor/trial_a2b0718070dcd1d815fe093a8ebb90ab/checkpoints/epoch_52/checkpoint: Not found: ./structured_data_regressor/trial_a2b0718070dcd1d815fe093a8ebb90ab/checkpoints/epoch_52; No such file or directory
achaar commented 4 years ago

Hello @KirkDCO , Can you please try again using the method mentioned in : https://github.com/achaar/autokeras/blob/img_reg_tutorial/docs/templates/tutorial/image_regression.ipynb

Please update the AutoKeras-tuner also.

Thanks.

KirkDCO commented 4 years ago

I'm not sure I understand what method you're referring to. The code you linked uses an ImageRegressor and is very focused around that concept instead of a StructuredDataRegression.

Regarding updating AutoKeras-tuner, are you referring to keras-tuner (https://github.com/keras-team/keras-tuner)? The example pulls from a specific commit which seems very unmaintainable.

achaar commented 4 years ago

Yes, you can use the specific commit for the time being. The issue will be resolved in the next release.

I was referring to the installation procedure for AutoKeras and keras-tuner.

KirkDCO commented 4 years ago

Ah, that makes more sense.

I have installed that version of keras-tuner, am pulling from master on autokeras, and have tf 2.2.0. Still the same error.

I also tried keras-tuner/master as described in an e-mail from Haifeng Jin earlier today. Same error there, too.

achaar commented 4 years ago

Okay, then we'll try to resolve this and get back as soon as possible.

pingusix commented 4 years ago

I am seeing this as well with structured_data_classifier after trying the various combinations of code mentioned earlier. I get the error a lot of the time though not always and when it fails I can verify that the checkpoint does not exist under the trials_xx/checkpoints/epoch_yy path. The directory parameter is specified as an absolute path and grep doesn't seem to show that the file has been saved somewhere else as far as I can see.

chyt commented 4 years ago

I am seeing this as well using ImageClassifier on Google Colab connected to Google Drive. Usually everything is fine with a smaller number of trials/epochs. However when I increase this number (right now I am doing 10 trials and 200 epochs) I get the error and the epoch directory does not exist. This seems to be a bit different of an error than what @pingusix is experiencing, since he seems to indicate that the epoch directory does exist.

I was wondering if it could have something to do with getting disconnected from Google Colab, but based on the reports here it seems like it is not the case.

I'm using autokeras 1.0.3, keras tuner 1.0.2rc0, and tensorflow 2.2.0.

haifeng-jin commented 4 years ago

Originally, this error is because Keras Tuner deletes the old checkpoints saved on disk to reduce the disk usage. However, we have fixed this in a recent pull request (https://github.com/keras-team/keras-tuner/pull/318). Not sure why still exists.

@chyt Would you share a colab notebook for the reproduction? Thanks.

chyt commented 4 years ago

Originally, this error is because Keras Tuner deletes the old checkpoints saved on disk to reduce the disk usage. However, we have fixed this in a recent pull request (keras-team/keras-tuner#318). Not sure why still exists.

@chyt Would you share a colab notebook for the reproduction? Thanks.

How would I share this notebook with you? The code is pretty straightforward, but the dataset is loaded from Google Drive.

Edit: here is a Gist of the notebook: https://gist.github.com/chyt/79e2f9de030c11e990af8595e7da631b

You can see the error near the bottom of the output:

ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /content/drive/My Drive/image_classifier/trial_6a080628b2d822154a54edc6187850b6/checkpoints/epoch_67/checkpoint: Not found: /content/drive/My Drive/image_classifier/trial_6a080628b2d822154a54edc6187850b6/checkpoints/epoch_67; No such file or directory
mreismendes commented 4 years ago

I've been having the same problem quoted by @chyt! image

haifeng-jin commented 4 years ago

@chyt It seems you are facing multiple bugs while using AutoKeras. It would be great if we can schedule an one hour meeting to help you debug. We can use half of the meeting to debug and half of the meeting for you to answer some of our user study questions. You can join our slack and message me your email address. Thank you!

ricwo commented 4 years ago

@haifeng-jin I'm having the same issue. ak is trying to restore epoch 2, which was not persisted:

ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /data/auto-clf-20-epochs-4-trials/trial_479975cd710064c8120d0f95a2c0b230/checkpoints/epoch_2/checkpoint: Not found: /data/auto-clf-20-epochs-4-trials/trial_479975cd710064c8120d0f95a2c0b230/checkpoints/epoch_2; No such file or directory

However epoch 2 was not persisted during training:

ls /data/auto-clf-20-epochs-4-trials/trial_479975cd710064c8120d0f95a2c0b230/checkpoints
epoch_0  epoch_10  epoch_11  epoch_12  epoch_3  epoch_4  epoch_5  epoch_6  epoch_7  epoch_8  epoch_9

Is there anything I can do to help fix this issue?

haifeng-jin commented 4 years ago

@ricwo Are you using overwrite=False? If you let overwrite=True, I think the error would gone.

ricwo commented 4 years ago

@haifeng-jin setting overwrite=False indeed fixed the issue. Thanks for your help!

sunseaandpalms commented 4 years ago

Is overwrite=True not working yet? Any stable solution to continue the seeking/training after being stopped?

haifeng-jin commented 4 years ago

We have fixed the problem in the master branch. You can use the master branch with keras-tuner 1.0.2rc1 tag.

jacobswain commented 4 years ago

I'm on 1.0.2rc1 and I'm getting the same error

haifeng-jin commented 4 years ago

We have just had a new release 1.0.5. I expect use it with TF 2.3.0 can fix the bug. We have fixed another bug in that might cause the problem so it should work this time.

swapnil96 commented 4 years ago

Getting error ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./structured_data_regressor/trial_869518eb879f942102cf397ccdff8288/checkpoints/epoch_2/checkpoint: Not found: ./structured_data_regressor/trial_869518eb879f942102cf397ccdff8288/checkpoints/epoch_2; No such file or directory

ls structured_data_regressor/trial_869518eb879f942102cf397ccdff8288/checkpoints/ epoch_0 epoch_10 epoch_11 epoch_12 epoch_3 epoch_4 epoch_5 epoch_6 epoch_7 epoch_8 epoch_9

import autokeras as ak ak.version '1.0.8' import tensorflow as tf tf.version '2.3.0' import kerastuner kerastuner.version '1.0.2rc1'

q-55555 commented 4 years ago

Same problem on my side : tensorflow 2.3.1, autokeras 1.0.9, keras-tuner 1.0.2rc2. Impossible to use autokeras... Any workaround ?

haifeng-jin commented 4 years ago

I cannot reproduce the error. Is there any colab examples that reproduces the error? Thanks.

q-55555 commented 4 years ago

Here is the colab to reproduce the error: https://colab.research.google.com/drive/1jOzTuL26UZaISnSZkUD273yZ1cqrX1QL?usp=sharing

The error seems to appear when patience is higher than the variable self._save_n_checkpoints (hardcoded => 10) of kerastuner/engine/tuner.py: See callbacks=[tf.keras.callbacks.EarlyStopping(training_objective, patience=50)] in the colab.

With this configuration, for an unknown reason, the best checkpoint is removed and then when autokeras is looking for it, it crashes. It may be a problem with the function "save_model(self, trial_id, model, step=0)" defined in kerastuner/engine/tuner.py

A hack would be to increase self._save_n_checkpoints to patience.

mh-hassan18 commented 4 years ago

Facing the same problem, here are the version details: tensorflow 2.3.1 autokeras 1.0.9 keras-tuner 1.0.2rc2

haifeng-jin commented 4 years ago

The fix is in a pending pull request to keras tuner. https://github.com/keras-team/keras-tuner/pull/424 We will have a new tag for kerastuner after this one is merged.

mh-hassan18 commented 4 years ago

How long will it take to fix the bug?

haifeng-jin commented 3 years ago

This bug has been fixed in Keras Tuner 1.0.2.

slimebob1975 commented 1 year ago

Encountered a similar error in tensorflow 2.12.0 using scikeras example code:

https://github.com/adriangb/scikeras/issues/291