tensorflow / models

Models and examples built with TensorFlow
Other
76.99k stars 45.78k forks source link

Unable to use "mixed_float16" in Object detect API #11215

Open tq3940 opened 3 months ago

tq3940 commented 3 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection

2. Describe the bug

I'm trying to use "mixed_float16" to speed up my training on RTX 4090. Following the guide of official document of mixed_precision , I add the code: mixed_precision.set_global_policy('mixed_float16') in front of tf.compat.v1.app.run() in my train_tf2.py. However, the tensorflow reborted the following error:

        return _compute_losses_and_predictions_dicts(model, features, labels,
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 130, in _compute_losses_and_predictions_dicts  *
        losses_dict = model.loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3967, in loss  *
        object_center_loss = self._compute_object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3099, in _compute_object_center_loss  *
        loss += object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 94, in __call__  *
        return self._compute_loss(prediction_tensor, target_tensor, **params)
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 855, in _compute_loss  *
        negative_loss = (tf.math.pow((1 - target_tensor), self._beta)*

    TypeError: Input 'y' of 'Mul' Op has type float16 that does not match type float32 of argument 'x'.

I also tried to add this code: tf.compat.v2.keras.mixed_precision.set_global_policy('mixed_float16') , which I modified on the basis of tf.compat.v2.keras.mixed_precision.set_global_policy('mixed_bfloat16') found in the file model_lib_v2.py

or add Environment variables by os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1', which was suggesd in this answer

But all of my attempt have failed with above error. I want to know how to solve this issue.

3. Steps to reproduce

add the code: mixed_precision.set_global_policy('mixed_float16') in front of tf.compat.v1.app.run() in my train_tf2.py.

4. Expected behavior

The model can be trained in "mixed precision" mode.

5. Additional context

None

6. System information

google-ml-butler[bot] commented 3 months ago

Are you satisfied with the resolution of your issue? Yes No

tq3940 commented 3 months ago

I am training the pre-trained model: centernet_hg104_512x512_coco17_tpu-8

tq3940 commented 3 months ago

I repeated my first attempt again that adding tf.compat.v2.keras.mixed_precision.set_global_policy('mixed_float16') in train_loop() (model_lib_v2.py) this time and get such a info which seemed to enable mixed precision successfully:

INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 4090, compute capability 8.9
I0603 20:37:33.016378 140087643665600 device_compatibility_check.py:130] Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 4090, compute capability 8.9

HOWEVER! The same error appeared again!!

Traceback (most recent call last):
  File "scripts/model_main_tf2.py", line 133, in <module>
    tf.compat.v1.app.run()
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/platform/app.py", line 36, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/root/miniconda3/lib/python3.8/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/root/miniconda3/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "scripts/model_main_tf2.py", line 105, in main
    model_lib_v2.train_loop(
  File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 609, in train_loop
    load_fine_tune_checkpoint(
  File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 401, in load_fine_tune_checkpoint
    _ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
  File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 176, in _ensure_model_is_built
    strategy.run(
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1673, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3250, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 696, in _call_for_each_replica
    return mirrored_run.call_for_each_replica(
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py", line 84, in call_for_each_replica
    return wrapped(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/tmp/__autograph_generated_file7sbf228o.py", line 14, in tf__wrapped_fn
    retval_ = ag__.converted_call(ag__.ld(call_for_each_replica), (ag__.ld(strategy), ag__.ld(fn).python_function, ag__.ld(args), ag__.ld(kwargs)), None, fscope)
  File "/tmp/__autograph_generated_file3vmq98ob.py", line 18, in tf___dummy_computation_fn
    retval_ = ag__.converted_call(ag__.ld(_compute_losses_and_predictions_dicts), (ag__.ld(model), ag__.ld(features), ag__.ld(labels)), dict(training_step=0), fscope)
  File "/tmp/__autograph_generated_filen7syj8px.py", line 16, in tf___compute_losses_and_predictions_dicts
    losses_dict = ag__.converted_call(ag__.ld(model).loss, (ag__.ld(prediction_dict), ag__.ld(features)[ag__.ld(fields).InputDataFields.true_image_shape]), None, fscope)
  File "/tmp/__autograph_generated_filekxlnseiw.py", line 16, in tf__loss
    object_center_loss = ag__.converted_call(ag__.ld(self)._compute_object_center_loss, (), dict(object_center_predictions=ag__.ld(prediction_dict)[ag__.ld(OBJECT_CENTER)], input_height=ag__.ld(input_height), input_width=ag__.ld(input_width), per_pixel_weights=ag__.ld(valid_anchor_weights), maximum_normalized_coordinate=ag__.ld(maximum_normalized_coordinate)), fscope)
  File "/tmp/__autograph_generated_file9k9b4p7p.py", line 77, in tf___compute_object_center_loss
    ag__.for_stmt(ag__.ld(object_center_predictions), None, loop_body, get_state_2, set_state_2, ('loss',), {'iterate_names': 'pred'})
  File "/tmp/__autograph_generated_file9k9b4p7p.py", line 75, in loop_body
    loss += ag__.converted_call(object_center_loss, (pred, flattened_heatmap_targets), dict(weights=per_pixel_weights), fscope)
  File "/tmp/__autograph_generated_filelc2if2ji.py", line 69, in tf____call__
    retval_ = ag__.converted_call(ag__.ld(self)._compute_loss, (ag__.ld(prediction_tensor), ag__.ld(target_tensor)), dict(**ag__.ld(params)), fscope)
  File "/tmp/__autograph_generated_file3o0kthr0.py", line 15, in tf___compute_loss
    negative_loss = ((ag__.converted_call(ag__.ld(tf).math.pow, ((1 - ag__.ld(target_tensor)), ag__.ld(self)._beta), None, fscope) * ag__.converted_call(ag__.ld(tf).math.pow, (ag__.ld(prediction_tensor), ag__.ld(self)._alpha), None, fscope)) * ag__.converted_call(ag__.ld(tf).math.log, ((1 - ag__.ld(prediction_tensor)),), None, fscope))
TypeError: in user code:

    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 171, in _dummy_computation_fn  *
        return _compute_losses_and_predictions_dicts(model, features, labels,
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 130, in _compute_losses_and_predictions_dicts  *
        losses_dict = model.loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3967, in loss  *
        object_center_loss = self._compute_object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3099, in _compute_object_center_loss  *
        loss += object_center_loss(
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 94, in __call__  *
        return self._compute_loss(prediction_tensor, target_tensor, **params)
    File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 855, in _compute_loss  *
        negative_loss = (tf.math.pow((1 - target_tensor), self._beta)*

    TypeError: Input 'y' of 'Mul' Op has type float16 that does not match type float32 of argument 'x'.

WHY??? 😭😭😭 Who can help me?? I need your help!