I'm having a weird problem where when I use ImageClassifier in Autokeras 2.0.0 without max_trials = 1, the whole pipeline crashes with a CUDA_ERROR_ILLEGAL_ADDRESS and CUDA_ERROR_INVALID_HANDLE error. It gets through the first trial just fine but upon starting the second trial it crashes.
Bug Reproduction
Code for reproducing the bug:
It's just boilerplate image classifier code:
batch_size = 10
img_height = 99674 #was 29303965
img_width = 6
train_data = ak.image_dataset_from_directory(
data_dir,
# Use 20% data as testing data.
validation_split=0.2,
subset="training",
# Set seed to ensure the same split when loading testing data.
seed=0,
image_size=(img_height, img_width),
batch_size=batch_size,
)
test_data = ak.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=0,
image_size=(img_height, img_width),
batch_size=batch_size,
)
clf = ak.ImageClassifier(num_classes=2,
loss = "auc",
directory = args.out_model,
seed = 0)
clf.fit(x=train_data, validation_data=test_data)
Output from training:
2024-04-24 14:40:09.349858: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-24 14:40:12.967614: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Namespace(files='/scratch/groups/hanleeji/CREST_images/chunk_100k/chunk001/', out_model='/scratch/groups/hanleeji/CREST_images/models/chunk001_ak/')
2024-04-24 14:40:20.850001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38367 MB memory: -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-04-24 14:40:24.193006: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-04-24 14:40:31.741614: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
Found 833 files belonging to 2 classes.
Using 667 files for training.
Found 833 files belonging to 2 classes.
Using 166 files for validation.
Search: Running Trial #1
Value |Best Value So Far |Hyperparameter
vanilla |vanilla |image_block_1/block_type
True |True |image_block_1/normalize
False |False |image_block_1/augment
3 |3 |image_block_1/conv_block_1/kernel_size
1 |1 |image_block_1/conv_block_1/num_blocks
2 |2 |image_block_1/conv_block_1/num_layers
True |True |image_block_1/conv_block_1/max_pooling
False |False |image_block_1/conv_block_1/separable
0.25 |0.25 |image_block_1/conv_block_1/dropout
32 |32 |image_block_1/conv_block_1/filters_0_0
64 |64 |image_block_1/conv_block_1/filters_0_1
flatten |flatten |classification_head_1/spatial_reduction_1/reduction_type
0.5 |0.5 |classification_head_1/dropout
adam |adam |optimizer
0.001 |0.001 |learning_rate
Epoch 1/1000
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1713994833.007195 84052 service.cc:145] XLA service 0x7f1b04003600 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1713994833.007327 84052 service.cc:153] StreamExecutor device (0): NVIDIA A100-PCIE-40GB, Compute Capability 8.0
2024-04-24 14:40:33.063821: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-04-24 14:40:33.754554: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8907
I0000 00:00:1713994847.238606 84052 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
67/67 ━━━━━━━━━━━━━━━━━━━━ 29s 209ms/step - accuracy: 0.5139 - loss: 7.1502 - val_accuracy: 0.6145 - val_loss: 0.6860
Epoch 2/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 6s 91ms/step - accuracy: 0.9425 - loss: 0.2806 - val_accuracy: 0.5964 - val_loss: 0.7732
Epoch 3/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 90ms/step - accuracy: 1.0000 - loss: 0.0206 - val_accuracy: 0.6024 - val_loss: 0.9873
Epoch 4/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 6s 88ms/step - accuracy: 1.0000 - loss: 0.0035 - val_accuracy: 0.6446 - val_loss: 1.6338
Epoch 5/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 94ms/step - accuracy: 1.0000 - loss: 0.0016 - val_accuracy: 0.6386 - val_loss: 1.4921
Epoch 6/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 92ms/step - accuracy: 1.0000 - loss: 6.3239e-04 - val_accuracy: 0.6446 - val_loss: 1.6766
Epoch 7/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 91ms/step - accuracy: 1.0000 - loss: 4.7770e-04 - val_accuracy: 0.6446 - val_loss: 1.7681
Epoch 8/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 91ms/step - accuracy: 1.0000 - loss: 2.8391e-04 - val_accuracy: 0.6446 - val_loss: 1.8877
Epoch 9/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 92ms/step - accuracy: 1.0000 - loss: 2.8112e-04 - val_accuracy: 0.6386 - val_loss: 1.9450
Epoch 10/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 93ms/step - accuracy: 1.0000 - loss: 1.9816e-04 - val_accuracy: 0.6386 - val_loss: 2.0117
Epoch 11/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 93ms/step - accuracy: 1.0000 - loss: 1.6838e-04 - val_accuracy: 0.6386 - val_loss: 2.0466
2024-04-24 14:42:12.680198: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
Trial 1 Complete [00h 01m 41s]
val_loss: 0.6860103011131287
Best val_loss So Far: 0.6860103011131287
Total elapsed time: 00h 01m 41s
Search: Running Trial #2
Value |Best Value So Far |Hyperparameter
resnet |vanilla |image_block_1/block_type
True |True |image_block_1/normalize
True |False |image_block_1/augment
True |None |image_block_1/image_augmentation_1/horizontal_flip
True |None |image_block_1/image_augmentation_1/vertical_flip
0 |None |image_block_1/image_augmentation_1/contrast_factor
0 |None |image_block_1/image_augmentation_1/rotation_factor
0.1 |None |image_block_1/image_augmentation_1/translation_factor
0 |None |image_block_1/image_augmentation_1/zoom_factor
False |None |image_block_1/res_net_block_1/pretrained
resnet50 |None |image_block_1/res_net_block_1/version
True |None |image_block_1/res_net_block_1/imagenet_size
global_avg |flatten |classification_head_1/spatial_reduction_1/reduction_type
0 |0.5 |classification_head_1/dropout
adam |adam |optimizer
0.001 |0.001 |learning_rate
Epoch 1/1000
2024-04-24 14:42:37.904335: W tensorflow/core/kernels/gpu_utils.cc:68] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
2024-04-24 14:42:41.385410: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
2024-04-24 14:42:41.385517: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleGetFunction(&function, module, kernel_name)' failed with 'CUDA_ERROR_INVALID_HANDLE'
2024-04-24 14:42:41.385535: W tensorflow/core/framework/op_kernel.cc:1827] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
2024-04-24 14:42:41.385565: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
[[{{function_node __inference_one_step_on_data_59919}}{{node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference}}]]
2024-04-24 14:42:41.385615: W tensorflow/core/framework/op_kernel.cc:1827] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
Traceback (most recent call last):
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 274, in _try_run_and_update_trial
self._run_and_update_trial(trial, *fit_args, **fit_kwargs)
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 239, in _run_and_update_trial
results = self.run_trial(trial, *fit_args, **fit_kwargs)
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 314, in run_trial
obj_value = self._build_and_fit_model(trial, *args, **copied_kwargs)
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 102, in _build_and_fit_model
_, history = utils.fit_with_adaptive_batch_size(
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 69, in fit_with_adaptive_batch_size
history = run_with_adaptive_batch_size(
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 82, in run_with_adaptive_batch_size
history = func(x=x, validation_data=validation_data, **fit_kwargs)
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 70, in <lambda>
batch_size, lambda **kwargs: model.fit(**kwargs), **fit_kwargs
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference defined at (most recent call last):
File "/home/groups/hanleeji/Scripts/billylau/CREST_autokeras.py", line 67, in <module>
File "/home/groups/hanleeji/Scripts/billylau/CREST_autokeras.py", line 59, in main
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/tasks/image.py", line 168, in fit
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/auto_model.py", line 303, in fit
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 202, in search
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 234, in search
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 274, in _try_run_and_update_trial
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 239, in _run_and_update_trial
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 314, in run_trial
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 102, in _build_and_fit_model
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 69, in fit_with_adaptive_batch_size
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 82, in run_with_adaptive_batch_size
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 70, in <lambda>
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 314, in fit
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 833, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 889, in _call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 696, in _initialize
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 178, in trace_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 283, in _maybe_define_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 310, in _create_concrete_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 1059, in func_graph_from_py_func
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 599, in wrapped_fn
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 41, in autograph_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 339, in converted_call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 117, in one_step_on_iterator
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1673, in run
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3263, in call_for_each_replica
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 4061, in _call_for_each_replica
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 833, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 906, in _call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 132, in call_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 178, in trace_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 283, in _maybe_define_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 310, in _create_concrete_function
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 1059, in func_graph_from_py_func
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 599, in wrapped_fn
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 41, in autograph_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 331, in converted_call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 104, in one_step_on_data
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 51, in train_step
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 199, in call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 151, in _run_through_graph
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 589, in call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 199, in call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 151, in _run_through_graph
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 589, in call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/normalization/batch_normalization.py", line 224, in call
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/normalization/batch_normalization.py", line 289, in _moments
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/nn.py", line 1726, in moments
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/nn.py", line 724, in moments
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/nn.py", line 767, in _compute_moments
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/nn_impl.py", line 1315, in moments_v2
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/nn_impl.py", line 1267, in moments
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/gen_math_ops.py", line 12174, in squared_difference
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 670, in _create_op_internal
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2682, in _create_op_internal
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1177, in from_node_def
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1043, in _create_c_op
File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/tf_stack.py", line 162, in extract_stack
'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
[[{{node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference}}]] [Op:__inference_one_step_on_iterator_61440]
2024-04-24 14:42:43.435179: E external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:819] failed to record completion event; therefore, failed to create inter-stream dependency
2024-04-24 14:42:43.435283: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:2025] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7f10c2a8d600; host src: 0x7f1a35200000; size: 8=0x8
2024-04-24 14:42:43.435301: E external/local_xla/xla/stream_executor/stream.cc:331] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2024-04-24 14:42:43.435315: E external/local_xla/xla/stream_executor/cuda/cuda_event.cc:30] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-04-24 14:42:43.435327: F tensorflow/core/common_runtime/device/device_event_mgr.cc:223] Unexpected Event status: 1
Expected Behavior
I'm expecting it to get through to other trials since trial 1 worked.
Bug Description
I'm having a weird problem where when I use ImageClassifier in Autokeras 2.0.0 without max_trials = 1, the whole pipeline crashes with a CUDA_ERROR_ILLEGAL_ADDRESS and CUDA_ERROR_INVALID_HANDLE error. It gets through the first trial just fine but upon starting the second trial it crashes.
Bug Reproduction
Code for reproducing the bug:
It's just boilerplate image classifier code:
Output from training:
Expected Behavior
I'm expecting it to get through to other trials since trial 1 worked.
Setup Details
Include the details about the versions of:
Additional context