keras-team / keras-cv

Industry-strength Computer Vision workflows with Keras
Other
1.01k stars 331 forks source link

Use jit_compile=False for CenterNet to avoid XLA compilation during training #2406

Closed TillBeemelmanns closed 6 months ago

TillBeemelmanns commented 8 months ago

I am running into problems when using keras-cv/examples/training/object_detection_3d/waymo/train_pillars.py with Keras 3. Some of the layers constantly recompile XLA (probably the voxelization layer) causing very long step time and OOM crash. Withjit_compile=False the problem does not appear.

jit_compile=True

Model: "multi_head_center_pillar"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                  ┃ Output Shape              ┃         Param # ┃ Connected to               ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ point_mask (InputLayer)       │ (None, None, 1)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ point_xyz (InputLayer)        │ (None, None, 3)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ point_feature (InputLayer)    │ (None, None, 1)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ get_item (GetItem)            │ (None, None)              │               0 │ point_mask[0][0]           │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ dynamic_voxelization          │ (None, 512, 512, 128)     │           1,152 │ point_xyz[0][0],           │
│ (DynamicVoxelization)         │                           │                 │ point_feature[0][0],       │
│                               │                           │                 │ get_item[0][0]             │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ center_pillar_backbone        │ (None, 512, 512, 256)     │      19,286,656 │ dynamic_voxelization[0][0] │
│ (CenterPillarBackbone)        │                           │                 │                            │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ detection_head                │ [(None, 512, 512, 32),    │          12,336 │ center_pillar_backbone[0]… │
│ (MultiClassDetectionHead)     │ (None, 512, 512, 16)]     │                 │                            │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ box_class_1 (Identity)        │ (None, 512, 512, 32)      │               0 │ detection_head[0][0]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ box_class_2 (Identity)        │ (None, 512, 512, 16)      │               0 │ detection_head[0][1]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ heatmap_class_1 (Identity)    │ (None, 512, 512, 32)      │               0 │ detection_head[0][0]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ heatmap_class_2 (Identity)    │ (None, 512, 512, 16)      │               0 │ detection_head[0][1]       │
└───────────────────────────────┴───────────────────────────┴─────────────────┴────────────────────────────┘
 Total params: 19,300,144 (73.62 MB)
 Trainable params: 19,282,224 (73.56 MB)
 Non-trainable params: 17,920 (70.00 KB)
Epoch 1/50

Epoch 1: LearningRateScheduler setting learning rate to 0.0009048374610134307.
I0000 00:00:1711905600.921507 2492871 service.cc:145] XLA service 0x55e59d2bf900 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1711905600.921579 2492871 service.cc:153]   StreamExecutor device (0): NVIDIA A100-SXM4-40GB, Compute Capability 8.0
2024-03-31 17:20:14.187293: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
2024-03-31 17:20:22.313706: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Constant folding an instruction is taking > 1s:

  %divide.2635 = f32[4,199600,3]{2,1,0} divide(f32[4,199600,3]{2,1,0} %constant.2632, f32[4,199600,3]{2,1,0} %broadcast.2634), metadata={op_type="RealDiv" op_name="multi_head_center_pillar_1/dynamic_voxelization_1/point_to_voxel_1/truediv" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1177}

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2024-03-31 17:20:24.539890: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 3.22656167s
Constant folding an instruction is taking > 1s:

  %divide.2635 = f32[4,199600,3]{2,1,0} divide(f32[4,199600,3]{2,1,0} %constant.2632, f32[4,199600,3]{2,1,0} %broadcast.2634), metadata={op_type="RealDiv" op_name="multi_head_center_pillar_1/dynamic_voxelization_1/point_to_voxel_1/truediv" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1177}

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2024-03-31 17:20:26.540595: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Constant folding an instruction is taking > 2s:

  %add.2638 = f32[4,199600,3]{2,1,0} add(f32[4,199600,3]{2,1,0} %constant.116, f32[4,199600,3]{2,1,0} %broadcast.2637), metadata={op_type="AddV2" op_name="multi_head_center_pillar_1/dynamic_voxelization_1/point_to_voxel_1/add" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1177}

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2024-03-31 17:20:27.899858: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 3.359643651s
Constant folding an instruction is taking > 2s:

  %add.2638 = f32[4,199600,3]{2,1,0} add(f32[4,199600,3]{2,1,0} %constant.116, f32[4,199600,3]{2,1,0} %broadcast.2637), metadata={op_type="AddV2" op_name="multi_head_center_pillar_1/dynamic_voxelization_1/point_to_voxel_1/add" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1177}

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2024-03-31 17:21:02.275112: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[4,128,512,512]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,128,512,512]{3,2,1,0}, f32[128,128,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:02.328674: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.053667725s
Trying algorithm eng0{} for conv (f32[4,128,512,512]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,128,512,512]{3,2,1,0}, f32[128,128,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:05.057390: W external/local_tsl/tsl/framework/bfc_allocator.cc:368] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2024-03-31 17:21:06.057462: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng3{k11=0} for conv (f32[4,256,128,128]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,128,128]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:06.214816: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.157470019s
Trying algorithm eng3{k11=0} for conv (f32[4,256,128,128]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,128,128]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:10.765531: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 130.03GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-03-31 17:21:12.544249: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[4,256,257,257]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,512,128,128]{3,2,1,0}, f32[512,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:13.134092: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.58994098s
Trying algorithm eng0{} for conv (f32[4,256,257,257]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,512,128,128]{3,2,1,0}, f32[512,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:17.903658: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[4,256,513,513]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,256,256]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:19.991362: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 3.087802008s
Trying algorithm eng0{} for conv (f32[4,256,513,513]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,256,256]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:25.011483: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[4,256,512,512]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,512,512]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:28.214537: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 4.203145385s
Trying algorithm eng0{} for conv (f32[4,256,512,512]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,512,512]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:35.614430: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[4,256,256,256]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,513,513]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:35.658271: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.044057506s
Trying algorithm eng0{} for conv (f32[4,256,256,256]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,513,513]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:47.787391: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[128,128,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,128,512,512]{3,2,1,0}, f32[4,128,512,512]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:47.887323: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.099978354s
Trying algorithm eng0{} for conv (f32[128,128,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,128,512,512]{3,2,1,0}, f32[4,128,512,512]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:55.627222: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[512,512,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,512,128,128]{3,2,1,0}, f32[4,512,128,128]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:55.711771: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.084633684s
Trying algorithm eng0{} for conv (f32[512,512,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,512,128,128]{3,2,1,0}, f32[4,512,128,128]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:55.874372: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 129.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-03-31 17:21:59.427328: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,256,256]{3,2,1,0}, f32[4,256,256,256]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:59.512810: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.085548719s
Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,256,256]{3,2,1,0}, f32[4,256,256,256]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:22:03.639446: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,513,513]{3,2,1,0}, f32[4,256,256,256]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:22:03.840970: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.201576719s
Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,513,513]{3,2,1,0}, f32[4,256,256,256]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:22:08.992589: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,512,512]{3,2,1,0}, f32[4,256,512,512]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:22:12.369822: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 4.377403431s
Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,512,512]{3,2,1,0}, f32[4,256,512,512]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
    4/39520 ━━━━━━━━━━━━━━━━━━━━ 484:59:51 44s/step - loss: 1106.45762024-03-31 17:25:18.613809: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 576.0KiB (589824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

jit_compile=False

Model: "multi_head_center_pillar"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                  ┃ Output Shape              ┃         Param # ┃ Connected to               ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ point_mask (InputLayer)       │ (None, None, 1)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ point_xyz (InputLayer)        │ (None, None, 3)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ point_feature (InputLayer)    │ (None, None, 1)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ get_item (GetItem)            │ (None, None)              │               0 │ point_mask[0][0]           │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ dynamic_voxelization          │ (None, 512, 512, 128)     │           1,152 │ point_xyz[0][0],           │
│ (DynamicVoxelization)         │                           │                 │ point_feature[0][0],       │
│                               │                           │                 │ get_item[0][0]             │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ center_pillar_backbone        │ (None, 512, 512, 256)     │      19,286,656 │ dynamic_voxelization[0][0] │
│ (CenterPillarBackbone)        │                           │                 │                            │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ detection_head                │ [(None, 512, 512, 32),    │          12,336 │ center_pillar_backbone[0]… │
│ (MultiClassDetectionHead)     │ (None, 512, 512, 16)]     │                 │                            │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ box_class_1 (Identity)        │ (None, 512, 512, 32)      │               0 │ detection_head[0][0]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ box_class_2 (Identity)        │ (None, 512, 512, 16)      │               0 │ detection_head[0][1]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ heatmap_class_1 (Identity)    │ (None, 512, 512, 32)      │               0 │ detection_head[0][0]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ heatmap_class_2 (Identity)    │ (None, 512, 512, 16)      │               0 │ detection_head[0][1]       │
└───────────────────────────────┴───────────────────────────┴─────────────────┴────────────────────────────┘
 Total params: 19,300,144 (73.62 MB)
 Trainable params: 19,282,224 (73.56 MB)
 Non-trainable params: 17,920 (70.00 KB)
Epoch 1/50

Epoch 1: LearningRateScheduler setting learning rate to 0.0009048374610134307.
  149/39520 ━━━━━━━━━━━━━━━━━━━━ 4:04:26 373ms/step - loss: 270.9218  

@divyashreepathihalli @sampathweb

google-cla[bot] commented 8 months ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

divyashreepathihalli commented 6 months ago

Hi @TillBeemelmanns it is upto the users to turn jit compile on or not. It will be on by default.

TillBeemelmanns commented 6 months ago

Yes sure, but this keras-cv demo does not work with the default jit_compile=True. Hence, it is necessary to change the default.