Closed innat closed 1 year ago
@sushreebarsa Let me know if you're able to reproduce the reported issue.
This is unexpected, this occurs in tf 2.11 but works in tf 2.9. Recently kaggle upgrades tf 2.9 to 2.11 and the last week code with tf 2.9 is just breaking in the current week with tf 2.11. Damn!
Hi @innat ,
I have tested the code snippet provided by you with TF2.9 version I am getting different error related to NCCL all_reduce.Please check and confirm.
(tf2.9) suryanarayanay@surya-ubuntu-22-04:~$ python3 17603.py
2023-02-27 08:53:21.351970: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2.9.2
2023-02-27 08:53:30.379104: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-27 08:53:32.857054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38251 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:00:04.0, compute capability: 8.0
2023-02-27 08:53:32.862592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 38251 MB memory: -> device: 1, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:00:05.0, compute capability: 8.0
2023-02-27 08:53:34.510546: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
key: "Targuments"
value {
list {
}
}
}
attr {
key: "_cardinality"
value {
i: -2
}
}
attr {
key: "f"
value {
func {
name: "__inference_Dataset_flat_map_slice_batch_indices_277"
}
}
}
attr {
key: "metadata"
value {
s: "\n\020FlatMapDataset:4"
}
}
attr {
key: "output_shapes"
value {
list {
shape {
dim {
size: -1
}
}
}
}
}
attr {
key: "output_types"
value {
list {
type: DT_INT64
}
}
}
experimental_type {
type_id: TFT_PRODUCT
args {
type_id: TFT_DATASET
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_INT64
}
}
}
}
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
Epoch 1/5
2023-02-27 08:53:44.776952: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2023-02-27 08:53:44.777276: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
Traceback (most recent call last):
File "/home/suryanarayanay/17603.py", line 39, in <module>
model.fit(
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node 'NcclAllReduce' defined at (most recent call last):
File "/home/suryanarayanay/17603.py", line 39, in <module>
model.fit(
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
return fn(*args, **kwargs)
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1409, in fit
tmp_logs = self.train_function(iterator)
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1051, in train_function
return step_function(self, iterator)
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1040, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/optimizers/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
Node: 'NcclAllReduce'
Detected at node 'NcclAllReduce' defined at (most recent call last):
File "/home/suryanarayanay/17603.py", line 39, in <module>
model.fit(
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
return fn(*args, **kwargs)
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1409, in fit
tmp_logs = self.train_function(iterator)
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1051, in train_function
return step_function(self, iterator)
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1040, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/optimizers/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
Node: 'NcclAllReduce'
2 root error(s) found.
(0) INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
[[{{node NcclAllReduce}}]]
[[All_4/_128]]
(1) INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
[[{{node NcclAllReduce}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_2578]
(tf2.9) suryanarayanay@surya-ubuntu-22-04:~$
@SuryanarayanaY let me know if you reproduced the above gist.
@markub3327 (cc. @SuryanarayanaY), Thanks for checking. By observing your gist, I've found that compiling the function with @tf.function
makes it possible to run.
@tf.function
def result(self):
@tf.function
def update_state(self
@SuryanarayanaY Could u please explain, why it is needed? Also note, compiling function with @tf.function
like this is only required if you enable mixed precision (in TF 2.11). The reported error message is too ambiguous as well to inspect.
File "/usr/local/lib/python3.8/dist-packages/tensorflow_addons/metrics/f_scores.py", line 170, in result * precision = tf.math.divide_no_nan(
File "/usr/local/lib/python3.8/dist-packages/keras/mixed_precision/autocast_variable.py", line 419, in add return self.read_value() + o File "/usr/local/lib/python3.8/dist-packages/keras/mixed_precision/autocast_variable.py", line 117, in read_value val = self._variable.read_value() NotImplementedError: call
variable.read_value()
inside variable_sync_on_read_context is not supported
@innat
I used tf.function
(also compilation) because another metrics in Keras-CV use it. I think about it.... Probably its speed up the operations of calculating metrics. Any other reason is ambiguous.
Please look here: https://github.com/keras-team/keras-cv/blob/master/keras_cv/metrics/coco/recall.py#L241 https://github.com/keras-team/keras-cv/blob/master/keras_cv/metrics/coco/recall.py#L124
@markub3327
Thanks for the hint. I'm not certian to use tf.function
in metrics to boost speed. These metrics should always run in graph mode, specifically in model.fit method. Also, I think this is not practiced in keras metrics if we look the source code. AFAIK, keras-cv forcefully used this to speed up some metrics (recall, map) that are computationally expensive, https://github.com/keras-team/keras-cv/pull/1356.
Apart from this, as I reported, this NotImplementedError
error occured with mixed precision in TF 2.11 and resolve with tf.function (it's doable but somewhat anti-pattern). But TF 2.9 (as tested), compiling with tf.function is not required.
Hi @innat ,
The reported error reproducible with TF2.11v and with TF2.9 there is no such error.Please refer attached gist. Same problem exists in tf-nightly(2.13.0-dev20230307) also.
Needs to be investigated. Thanks!
Adding Reed here who works on mixed precision. The F1 implementation looks normal to me in https://github.com/tensorflow/addons/blob/master/tensorflow_addons/metrics/f_scores.py
System information.
Describe the problem.
In TensorFlow 2.11, with mixed precision, and with mult-gpu set up, the
F1
metrics from tensorflow-addons don't work.But tensorflow of <=2.9 works as expected.