Open tq3940 opened 3 months ago
I am training the pre-trained model: centernet_hg104_512x512_coco17_tpu-8
I repeated my first attempt again that adding tf.compat.v2.keras.mixed_precision.set_global_policy('mixed_float16')
in train_loop() (model_lib_v2.py) this time and get such a info which seemed to enable mixed precision successfully:
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 4090, compute capability 8.9
I0603 20:37:33.016378 140087643665600 device_compatibility_check.py:130] Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 4090, compute capability 8.9
HOWEVER! The same error appeared again!!
Traceback (most recent call last):
File "scripts/model_main_tf2.py", line 133, in <module>
tf.compat.v1.app.run()
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/platform/app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/root/miniconda3/lib/python3.8/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/root/miniconda3/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "scripts/model_main_tf2.py", line 105, in main
model_lib_v2.train_loop(
File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 609, in train_loop
load_fine_tune_checkpoint(
File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 401, in load_fine_tune_checkpoint
_ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 176, in _ensure_model_is_built
strategy.run(
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1673, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3250, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 696, in _call_for_each_replica
return mirrored_run.call_for_each_replica(
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py", line 84, in call_for_each_replica
return wrapped(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/tmp/__autograph_generated_file7sbf228o.py", line 14, in tf__wrapped_fn
retval_ = ag__.converted_call(ag__.ld(call_for_each_replica), (ag__.ld(strategy), ag__.ld(fn).python_function, ag__.ld(args), ag__.ld(kwargs)), None, fscope)
File "/tmp/__autograph_generated_file3vmq98ob.py", line 18, in tf___dummy_computation_fn
retval_ = ag__.converted_call(ag__.ld(_compute_losses_and_predictions_dicts), (ag__.ld(model), ag__.ld(features), ag__.ld(labels)), dict(training_step=0), fscope)
File "/tmp/__autograph_generated_filen7syj8px.py", line 16, in tf___compute_losses_and_predictions_dicts
losses_dict = ag__.converted_call(ag__.ld(model).loss, (ag__.ld(prediction_dict), ag__.ld(features)[ag__.ld(fields).InputDataFields.true_image_shape]), None, fscope)
File "/tmp/__autograph_generated_filekxlnseiw.py", line 16, in tf__loss
object_center_loss = ag__.converted_call(ag__.ld(self)._compute_object_center_loss, (), dict(object_center_predictions=ag__.ld(prediction_dict)[ag__.ld(OBJECT_CENTER)], input_height=ag__.ld(input_height), input_width=ag__.ld(input_width), per_pixel_weights=ag__.ld(valid_anchor_weights), maximum_normalized_coordinate=ag__.ld(maximum_normalized_coordinate)), fscope)
File "/tmp/__autograph_generated_file9k9b4p7p.py", line 77, in tf___compute_object_center_loss
ag__.for_stmt(ag__.ld(object_center_predictions), None, loop_body, get_state_2, set_state_2, ('loss',), {'iterate_names': 'pred'})
File "/tmp/__autograph_generated_file9k9b4p7p.py", line 75, in loop_body
loss += ag__.converted_call(object_center_loss, (pred, flattened_heatmap_targets), dict(weights=per_pixel_weights), fscope)
File "/tmp/__autograph_generated_filelc2if2ji.py", line 69, in tf____call__
retval_ = ag__.converted_call(ag__.ld(self)._compute_loss, (ag__.ld(prediction_tensor), ag__.ld(target_tensor)), dict(**ag__.ld(params)), fscope)
File "/tmp/__autograph_generated_file3o0kthr0.py", line 15, in tf___compute_loss
negative_loss = ((ag__.converted_call(ag__.ld(tf).math.pow, ((1 - ag__.ld(target_tensor)), ag__.ld(self)._beta), None, fscope) * ag__.converted_call(ag__.ld(tf).math.pow, (ag__.ld(prediction_tensor), ag__.ld(self)._alpha), None, fscope)) * ag__.converted_call(ag__.ld(tf).math.log, ((1 - ag__.ld(prediction_tensor)),), None, fscope))
TypeError: in user code:
File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 171, in _dummy_computation_fn *
return _compute_losses_and_predictions_dicts(model, features, labels,
File "/root/miniconda3/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 130, in _compute_losses_and_predictions_dicts *
losses_dict = model.loss(
File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3967, in loss *
object_center_loss = self._compute_object_center_loss(
File "/root/miniconda3/lib/python3.8/site-packages/object_detection/meta_architectures/center_net_meta_arch.py", line 3099, in _compute_object_center_loss *
loss += object_center_loss(
File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 94, in __call__ *
return self._compute_loss(prediction_tensor, target_tensor, **params)
File "/root/miniconda3/lib/python3.8/site-packages/object_detection/core/losses.py", line 855, in _compute_loss *
negative_loss = (tf.math.pow((1 - target_tensor), self._beta)*
TypeError: Input 'y' of 'Mul' Op has type float16 that does not match type float32 of argument 'x'.
WHY??? 😭😭😭 Who can help me?? I need your help!
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/tree/master/research/object_detection
2. Describe the bug
I'm trying to use "mixed_float16" to speed up my training on RTX 4090. Following the guide of official document of mixed_precision , I add the code:
mixed_precision.set_global_policy('mixed_float16')
in front oftf.compat.v1.app.run()
in my train_tf2.py. However, the tensorflow reborted the following error:I also tried to add this code:
tf.compat.v2.keras.mixed_precision.set_global_policy('mixed_float16')
, which I modified on the basis oftf.compat.v2.keras.mixed_precision.set_global_policy('mixed_bfloat16')
found in the file model_lib_v2.pyor add Environment variables by
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
, which was suggesd in this answerBut all of my attempt have failed with above error. I want to know how to solve this issue.
3. Steps to reproduce
add the code:
mixed_precision.set_global_policy('mixed_float16')
in front oftf.compat.v1.app.run()
in my train_tf2.py.4. Expected behavior
The model can be trained in "mixed precision" mode.
5. Additional context
None
6. System information