Closed KirkDCO closed 3 years ago
Hello @KirkDCO , Can you please try again using the method mentioned in : https://github.com/achaar/autokeras/blob/img_reg_tutorial/docs/templates/tutorial/image_regression.ipynb
Please update the AutoKeras-tuner also.
Thanks.
I'm not sure I understand what method you're referring to. The code you linked uses an ImageRegressor and is very focused around that concept instead of a StructuredDataRegression.
Regarding updating AutoKeras-tuner, are you referring to keras-tuner (https://github.com/keras-team/keras-tuner)? The example pulls from a specific commit which seems very unmaintainable.
Yes, you can use the specific commit for the time being. The issue will be resolved in the next release.
I was referring to the installation procedure for AutoKeras and keras-tuner.
Ah, that makes more sense.
I have installed that version of keras-tuner, am pulling from master on autokeras, and have tf 2.2.0. Still the same error.
I also tried keras-tuner/master as described in an e-mail from Haifeng Jin earlier today. Same error there, too.
Okay, then we'll try to resolve this and get back as soon as possible.
I am seeing this as well with structured_data_classifier after trying the various combinations of code mentioned earlier. I get the error a lot of the time though not always and when it fails I can verify that the checkpoint does not exist under the trials_xx/checkpoints/epoch_yy path. The directory parameter is specified as an absolute path and grep doesn't seem to show that the file has been saved somewhere else as far as I can see.
I am seeing this as well using ImageClassifier on Google Colab connected to Google Drive. Usually everything is fine with a smaller number of trials/epochs. However when I increase this number (right now I am doing 10 trials and 200 epochs) I get the error and the epoch directory does not exist. This seems to be a bit different of an error than what @pingusix is experiencing, since he seems to indicate that the epoch directory does exist.
I was wondering if it could have something to do with getting disconnected from Google Colab, but based on the reports here it seems like it is not the case.
I'm using autokeras 1.0.3, keras tuner 1.0.2rc0, and tensorflow 2.2.0.
Originally, this error is because Keras Tuner deletes the old checkpoints saved on disk to reduce the disk usage. However, we have fixed this in a recent pull request (https://github.com/keras-team/keras-tuner/pull/318). Not sure why still exists.
@chyt Would you share a colab notebook for the reproduction? Thanks.
Originally, this error is because Keras Tuner deletes the old checkpoints saved on disk to reduce the disk usage. However, we have fixed this in a recent pull request (keras-team/keras-tuner#318). Not sure why still exists.
@chyt Would you share a colab notebook for the reproduction? Thanks.
How would I share this notebook with you? The code is pretty straightforward, but the dataset is loaded from Google Drive.
Edit: here is a Gist of the notebook: https://gist.github.com/chyt/79e2f9de030c11e990af8595e7da631b
You can see the error near the bottom of the output:
ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /content/drive/My Drive/image_classifier/trial_6a080628b2d822154a54edc6187850b6/checkpoints/epoch_67/checkpoint: Not found: /content/drive/My Drive/image_classifier/trial_6a080628b2d822154a54edc6187850b6/checkpoints/epoch_67; No such file or directory
I've been having the same problem quoted by @chyt!
@chyt It seems you are facing multiple bugs while using AutoKeras. It would be great if we can schedule an one hour meeting to help you debug. We can use half of the meeting to debug and half of the meeting for you to answer some of our user study questions. You can join our slack and message me your email address. Thank you!
@haifeng-jin I'm having the same issue. ak is trying to restore epoch 2, which was not persisted:
ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /data/auto-clf-20-epochs-4-trials/trial_479975cd710064c8120d0f95a2c0b230/checkpoints/epoch_2/checkpoint: Not found: /data/auto-clf-20-epochs-4-trials/trial_479975cd710064c8120d0f95a2c0b230/checkpoints/epoch_2; No such file or directory
However epoch 2 was not persisted during training:
ls /data/auto-clf-20-epochs-4-trials/trial_479975cd710064c8120d0f95a2c0b230/checkpoints
epoch_0 epoch_10 epoch_11 epoch_12 epoch_3 epoch_4 epoch_5 epoch_6 epoch_7 epoch_8 epoch_9
Is there anything I can do to help fix this issue?
@ricwo Are you using overwrite=False? If you let overwrite=True, I think the error would gone.
@haifeng-jin setting overwrite=False
indeed fixed the issue. Thanks for your help!
Is overwrite=True
not working yet? Any stable solution to continue the seeking/training after being stopped?
We have fixed the problem in the master branch. You can use the master branch with keras-tuner 1.0.2rc1 tag.
I'm on 1.0.2rc1 and I'm getting the same error
We have just had a new release 1.0.5. I expect use it with TF 2.3.0 can fix the bug. We have fixed another bug in that might cause the problem so it should work this time.
Getting error
ValueError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./structured_data_regressor/trial_869518eb879f942102cf397ccdff8288/checkpoints/epoch_2/checkpoint: Not found: ./structured_data_regressor/trial_869518eb879f942102cf397ccdff8288/checkpoints/epoch_2; No such file or directory
ls structured_data_regressor/trial_869518eb879f942102cf397ccdff8288/checkpoints/
epoch_0 epoch_10 epoch_11 epoch_12 epoch_3 epoch_4 epoch_5 epoch_6 epoch_7 epoch_8 epoch_9
import autokeras as ak ak.version '1.0.8' import tensorflow as tf tf.version '2.3.0' import kerastuner kerastuner.version '1.0.2rc1'
Same problem on my side : tensorflow 2.3.1, autokeras 1.0.9, keras-tuner 1.0.2rc2. Impossible to use autokeras... Any workaround ?
I cannot reproduce the error. Is there any colab examples that reproduces the error? Thanks.
Here is the colab to reproduce the error: https://colab.research.google.com/drive/1jOzTuL26UZaISnSZkUD273yZ1cqrX1QL?usp=sharing
The error seems to appear when patience is higher than the variable self._save_n_checkpoints (hardcoded => 10) of kerastuner/engine/tuner.py: See callbacks=[tf.keras.callbacks.EarlyStopping(training_objective, patience=50)] in the colab.
With this configuration, for an unknown reason, the best checkpoint is removed and then when autokeras is looking for it, it crashes. It may be a problem with the function "save_model(self, trial_id, model, step=0)" defined in kerastuner/engine/tuner.py
A hack would be to increase self._save_n_checkpoints to patience.
Facing the same problem, here are the version details: tensorflow 2.3.1 autokeras 1.0.9 keras-tuner 1.0.2rc2
The fix is in a pending pull request to keras tuner. https://github.com/keras-team/keras-tuner/pull/424 We will have a new tag for kerastuner after this one is merged.
How long will it take to fix the bug?
This bug has been fixed in Keras Tuner 1.0.2.
Encountered a similar error in tensorflow 2.12.0 using scikeras example code:
Bug Description
Trying to get started using AutoKeras and finding that most of the example code does not work.
Bug Reproduction
Running the example here: https://autokeras.com/tutorial/structured_data_regression/
Setup Details
Include the details about the versions of:
Error
ValueError Traceback (most recent call last)