keras-team / autokeras

AutoML library for deep learning
http://autokeras.com/
Apache License 2.0
9.1k stars 1.4k forks source link

Bug: Autokeras/TF fails with CUDA_ERROR_ILLEGAL_ADDRESS/CUDA_ERROR_INVALID_HANDLE when max_trials is not 1 #1916

Open billytcl opened 2 months ago

billytcl commented 2 months ago

Bug Description

I'm having a weird problem where when I use ImageClassifier in Autokeras 2.0.0 without max_trials = 1, the whole pipeline crashes with a CUDA_ERROR_ILLEGAL_ADDRESS and CUDA_ERROR_INVALID_HANDLE error. It gets through the first trial just fine but upon starting the second trial it crashes.

Bug Reproduction

Code for reproducing the bug:

It's just boilerplate image classifier code:

batch_size = 10
    img_height = 99674 #was 29303965
    img_width = 6

    train_data = ak.image_dataset_from_directory(
        data_dir,
        # Use 20% data as testing data.
        validation_split=0.2,
        subset="training",
        # Set seed to ensure the same split when loading testing data.
        seed=0,
        image_size=(img_height, img_width),
        batch_size=batch_size,
    )

    test_data = ak.image_dataset_from_directory(
        data_dir,
        validation_split=0.2,
        subset="validation",
        seed=0,
        image_size=(img_height, img_width),
        batch_size=batch_size,
    )

    clf = ak.ImageClassifier(num_classes=2,
                         loss = "auc",
                         directory = args.out_model,
                         seed = 0)

    clf.fit(x=train_data, validation_data=test_data)

Output from training:

2024-04-24 14:40:09.349858: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-24 14:40:12.967614: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Namespace(files='/scratch/groups/hanleeji/CREST_images/chunk_100k/chunk001/', out_model='/scratch/groups/hanleeji/CREST_images/models/chunk001_ak/')
2024-04-24 14:40:20.850001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38367 MB memory:  -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-04-24 14:40:24.193006: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-04-24 14:40:31.741614: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
Found 833 files belonging to 2 classes.
Using 667 files for training.
Found 833 files belonging to 2 classes.
Using 166 files for validation.

Search: Running Trial #1

Value             |Best Value So Far |Hyperparameter
vanilla           |vanilla           |image_block_1/block_type
True              |True              |image_block_1/normalize
False             |False             |image_block_1/augment
3                 |3                 |image_block_1/conv_block_1/kernel_size
1                 |1                 |image_block_1/conv_block_1/num_blocks
2                 |2                 |image_block_1/conv_block_1/num_layers
True              |True              |image_block_1/conv_block_1/max_pooling
False             |False             |image_block_1/conv_block_1/separable
0.25              |0.25              |image_block_1/conv_block_1/dropout
32                |32                |image_block_1/conv_block_1/filters_0_0
64                |64                |image_block_1/conv_block_1/filters_0_1
flatten           |flatten           |classification_head_1/spatial_reduction_1/reduction_type
0.5               |0.5               |classification_head_1/dropout
adam              |adam              |optimizer
0.001             |0.001             |learning_rate

Epoch 1/1000
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1713994833.007195   84052 service.cc:145] XLA service 0x7f1b04003600 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1713994833.007327   84052 service.cc:153]   StreamExecutor device (0): NVIDIA A100-PCIE-40GB, Compute Capability 8.0
2024-04-24 14:40:33.063821: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-04-24 14:40:33.754554: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8907
I0000 00:00:1713994847.238606   84052 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
67/67 ━━━━━━━━━━━━━━━━━━━━ 29s 209ms/step - accuracy: 0.5139 - loss: 7.1502 - val_accuracy: 0.6145 - val_loss: 0.6860
Epoch 2/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 6s 91ms/step - accuracy: 0.9425 - loss: 0.2806 - val_accuracy: 0.5964 - val_loss: 0.7732
Epoch 3/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 90ms/step - accuracy: 1.0000 - loss: 0.0206 - val_accuracy: 0.6024 - val_loss: 0.9873
Epoch 4/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 6s 88ms/step - accuracy: 1.0000 - loss: 0.0035 - val_accuracy: 0.6446 - val_loss: 1.6338
Epoch 5/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 94ms/step - accuracy: 1.0000 - loss: 0.0016 - val_accuracy: 0.6386 - val_loss: 1.4921
Epoch 6/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 92ms/step - accuracy: 1.0000 - loss: 6.3239e-04 - val_accuracy: 0.6446 - val_loss: 1.6766
Epoch 7/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 91ms/step - accuracy: 1.0000 - loss: 4.7770e-04 - val_accuracy: 0.6446 - val_loss: 1.7681
Epoch 8/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 91ms/step - accuracy: 1.0000 - loss: 2.8391e-04 - val_accuracy: 0.6446 - val_loss: 1.8877
Epoch 9/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 92ms/step - accuracy: 1.0000 - loss: 2.8112e-04 - val_accuracy: 0.6386 - val_loss: 1.9450
Epoch 10/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 93ms/step - accuracy: 1.0000 - loss: 1.9816e-04 - val_accuracy: 0.6386 - val_loss: 2.0117
Epoch 11/1000
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 93ms/step - accuracy: 1.0000 - loss: 1.6838e-04 - val_accuracy: 0.6386 - val_loss: 2.0466
2024-04-24 14:42:12.680198: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

Trial 1 Complete [00h 01m 41s]
val_loss: 0.6860103011131287

Best val_loss So Far: 0.6860103011131287
Total elapsed time: 00h 01m 41s

Search: Running Trial #2

Value             |Best Value So Far |Hyperparameter
resnet            |vanilla           |image_block_1/block_type
True              |True              |image_block_1/normalize
True              |False             |image_block_1/augment
True              |None              |image_block_1/image_augmentation_1/horizontal_flip
True              |None              |image_block_1/image_augmentation_1/vertical_flip
0                 |None              |image_block_1/image_augmentation_1/contrast_factor
0                 |None              |image_block_1/image_augmentation_1/rotation_factor
0.1               |None              |image_block_1/image_augmentation_1/translation_factor
0                 |None              |image_block_1/image_augmentation_1/zoom_factor
False             |None              |image_block_1/res_net_block_1/pretrained
resnet50          |None              |image_block_1/res_net_block_1/version
True              |None              |image_block_1/res_net_block_1/imagenet_size
global_avg        |flatten           |classification_head_1/spatial_reduction_1/reduction_type
0                 |0.5               |classification_head_1/dropout
adam              |adam              |optimizer
0.001             |0.001             |learning_rate

Epoch 1/1000
2024-04-24 14:42:37.904335: W tensorflow/core/kernels/gpu_utils.cc:68] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
2024-04-24 14:42:41.385410: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'

2024-04-24 14:42:41.385517: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleGetFunction(&function, module, kernel_name)' failed with 'CUDA_ERROR_INVALID_HANDLE'

2024-04-24 14:42:41.385535: W tensorflow/core/framework/op_kernel.cc:1827] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
2024-04-24 14:42:41.385565: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
         [[{{function_node __inference_one_step_on_data_59919}}{{node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference}}]]
2024-04-24 14:42:41.385615: W tensorflow/core/framework/op_kernel.cc:1827] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
Traceback (most recent call last):
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 274, in _try_run_and_update_trial
    self._run_and_update_trial(trial, *fit_args, **fit_kwargs)
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 239, in _run_and_update_trial
    results = self.run_trial(trial, *fit_args, **fit_kwargs)
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 314, in run_trial
    obj_value = self._build_and_fit_model(trial, *args, **copied_kwargs)
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 102, in _build_and_fit_model
    _, history = utils.fit_with_adaptive_batch_size(
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 69, in fit_with_adaptive_batch_size
    history = run_with_adaptive_batch_size(
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 82, in run_with_adaptive_batch_size
    history = func(x=x, validation_data=validation_data, **fit_kwargs)
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 70, in <lambda>
    batch_size, lambda **kwargs: model.fit(**kwargs), **fit_kwargs
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference defined at (most recent call last):
  File "/home/groups/hanleeji/Scripts/billylau/CREST_autokeras.py", line 67, in <module>

  File "/home/groups/hanleeji/Scripts/billylau/CREST_autokeras.py", line 59, in main

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/tasks/image.py", line 168, in fit

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/auto_model.py", line 303, in fit

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 202, in search

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 234, in search

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 274, in _try_run_and_update_trial

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 239, in _run_and_update_trial

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 314, in run_trial

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/engine/tuner.py", line 102, in _build_and_fit_model

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 69, in fit_with_adaptive_batch_size

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 82, in run_with_adaptive_batch_size

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/autokeras/utils/utils.py", line 70, in <lambda>

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 314, in fit

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 833, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 889, in _call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 696, in _initialize

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 178, in trace_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 283, in _maybe_define_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 310, in _create_concrete_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 1059, in func_graph_from_py_func

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 599, in wrapped_fn

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 41, in autograph_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 339, in converted_call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 117, in one_step_on_iterator

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1673, in run

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3263, in call_for_each_replica

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/distribute/distribute_lib.py", line 4061, in _call_for_each_replica

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 833, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 906, in _call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 132, in call_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 178, in trace_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 283, in _maybe_define_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 310, in _create_concrete_function

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 1059, in func_graph_from_py_func

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 599, in wrapped_fn

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 41, in autograph_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 331, in converted_call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 104, in one_step_on_data

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 51, in train_step

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 199, in call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 151, in _run_through_graph

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 589, in call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 199, in call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 151, in _run_through_graph

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 589, in call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 842, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 48, in __call__

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/normalization/batch_normalization.py", line 224, in call

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/layers/normalization/batch_normalization.py", line 289, in _moments

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/ops/nn.py", line 1726, in moments

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/nn.py", line 724, in moments

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/nn.py", line 767, in _compute_moments

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/nn_impl.py", line 1315, in moments_v2

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/nn_impl.py", line 1267, in moments

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/ops/gen_math_ops.py", line 12174, in squared_difference

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/func_graph.py", line 670, in _create_op_internal

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2682, in _create_op_internal

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1177, in from_node_def

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1043, in _create_c_op

  File "/home/groups/hanleeji/Virtual_envs/autokeras_env/lib/python3.10/site-packages/tensorflow/python/util/tf_stack.py", line 162, in extract_stack

'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
         [[{{node functional_1_1/resnet50_1/conv1_bn_1/moments/SquaredDifference}}]] [Op:__inference_one_step_on_iterator_61440]
2024-04-24 14:42:43.435179: E external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:819] failed to record completion event; therefore, failed to create inter-stream dependency
2024-04-24 14:42:43.435283: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:2025] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7f10c2a8d600; host src: 0x7f1a35200000; size: 8=0x8
2024-04-24 14:42:43.435301: E external/local_xla/xla/stream_executor/stream.cc:331] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2024-04-24 14:42:43.435315: E external/local_xla/xla/stream_executor/cuda/cuda_event.cc:30] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-04-24 14:42:43.435327: F tensorflow/core/common_runtime/device/device_event_mgr.cc:223] Unexpected Event status: 1

Expected Behavior

I'm expecting it to get through to other trials since trial 1 worked.

Setup Details

Include the details about the versions of:

Additional context