NotImplementedError: call `variable.read_value()` inside variable_sync_on_read_context is not supported

innat commented 1 year ago

System information.

Have I written custom code (as opposed to using a stock example script provided in Keras):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.11
Python version:
Bazel version (if compiling from source):
GPU model and memory:
Exact command to reproduce:

Describe the problem.

In TensorFlow 2.11, with mixed precision, and with mult-gpu set up, the F1 metrics from tensorflow-addons don't work.

from sklearn.datasets import make_classification
import tensorflow as tf
from tensorflow import keras
import tensorflow_addons as tfa
keras.mixed_precision.set_global_policy("mixed_float16")
strategy = tf.distribute.MirroredStrategy()

X, y = make_classification(
    n_classes=2, 
    n_features=8, 
    n_informative=8, 
    n_redundant=0, 
    random_state=42
)
with strategy.scope():
    model = keras.Sequential()
    model.add(keras.layers.Dense(64, input_dim=8, activation='relu'))
    model.add(keras.layers.Dense(1, activation='sigmoid'))
    model.compile(
        loss='binary_crossentropy',
        optimizer='adam',
        metrics=[
            tfa.metrics.F1Score(num_classes=1)
        ]
    )

model.fit(
    X, y, 
    epochs=5, 
    batch_size=32 * strategy.num_replicas_in_sync, 
    validation_split=0.1, 
    verbose=1
)

NotImplementedError: in user code:

File "/opt/conda/lib/python3.7/site-packages/tensorflow_addons/metrics/f_scores.py", line 170, in result  *
    precision = tf.math.divide_no_nan(
File "/opt/conda/lib/python3.7/site-packages/keras/mixed_precision/autocast_variable.py", line 419, in __add__
    return self.read_value() + o
File "/opt/conda/lib/python3.7/site-packages/keras/mixed_precision/autocast_variable.py", line 117, in read_value
    val = self._variable.read_value()

NotImplementedError: call `variable.read_value()` inside variable_sync_on_read_context is not supported

But tensorflow of <=2.9 works as expected.

innat commented 1 year ago

@sushreebarsa Let me know if you're able to reproduce the reported issue.

This is unexpected, this occurs in tf 2.11 but works in tf 2.9. Recently kaggle upgrades tf 2.9 to 2.11 and the last week code with tf 2.9 is just breaking in the current week with tf 2.11. Damn!

SuryanarayanaY commented 1 year ago

Hi @innat ,

I have tested the code snippet provided by you with TF2.9 version I am getting different error related to NCCL all_reduce.Please check and confirm.

(tf2.9) suryanarayanay@surya-ubuntu-22-04:~$ python3 17603.py
2023-02-27 08:53:21.351970: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2.9.2
2023-02-27 08:53:30.379104: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-27 08:53:32.857054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38251 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:00:04.0, compute capability: 8.0
2023-02-27 08:53:32.862592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 38251 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:00:05.0, compute capability: 8.0
2023-02-27 08:53:34.510546: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "_cardinality"
  value {
    i: -2
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_277"
    }
  }
}
attr {
  key: "metadata"
  value {
    s: "\n\020FlatMapDataset:4"
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
experimental_type {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_DATASET
    args {
      type_id: TFT_PRODUCT
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_INT64
        }
      }
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
Epoch 1/5
2023-02-27 08:53:44.776952: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2023-02-27 08:53:44.777276: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
Traceback (most recent call last):
  File "/home/suryanarayanay/17603.py", line 39, in <module>
    model.fit(
  File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/suryanarayanay/.local/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node 'NcclAllReduce' defined at (most recent call last):
    File "/home/suryanarayanay/17603.py", line 39, in <module>
      model.fit(
    File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1409, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1051, in train_function
      return step_function(self, iterator)
    File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1040, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/optimizers/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
      return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
Node: 'NcclAllReduce'
Detected at node 'NcclAllReduce' defined at (most recent call last):
    File "/home/suryanarayanay/17603.py", line 39, in <module>
      model.fit(
    File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1409, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1051, in train_function
      return step_function(self, iterator)
    File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1040, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/suryanarayanay/.local/lib/python3.10/site-packages/keras/optimizers/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
      return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
Node: 'NcclAllReduce'
2 root error(s) found.
  (0) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[{{node NcclAllReduce}}]]
         [[All_4/_128]]
  (1) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[{{node NcclAllReduce}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_2578]
(tf2.9) suryanarayanay@surya-ubuntu-22-04:~$

innat commented 1 year ago

@SuryanarayanaY Thanks for checking. Please find the gist here.

innat commented 1 year ago

@SuryanarayanaY let me know if you reproduced the above gist.

innat commented 1 year ago

@markub3327 (cc. @SuryanarayanaY), Thanks for checking. By observing your gist, I've found that compiling the function with @tf.function makes it possible to run.

@tf.function
def result(self):

@tf.function
def update_state(self

@SuryanarayanaY Could u please explain, why it is needed? Also note, compiling function with @tf.function like this is only required if you enable mixed precision (in TF 2.11). The reported error message is too ambiguous as well to inspect.

File "/usr/local/lib/python3.8/dist-packages/tensorflow_addons/metrics/f_scores.py", line 170, in result  *
  precision = tf.math.divide_no_nan(
File "/usr/local/lib/python3.8/dist-packages/keras/mixed_precision/autocast_variable.py", line 419, in add return self.read_value() + o File "/usr/local/lib/python3.8/dist-packages/keras/mixed_precision/autocast_variable.py", line 117, in read_value val = self._variable.read_value() NotImplementedError: call variable.read_value() inside variable_sync_on_read_context is not supported

markub3327 commented 1 year ago

@innat

I used tf.function (also compilation) because another metrics in Keras-CV use it. I think about it.... Probably its speed up the operations of calculating metrics. Any other reason is ambiguous.

Please look here: https://github.com/keras-team/keras-cv/blob/master/keras_cv/metrics/coco/recall.py#L241 https://github.com/keras-team/keras-cv/blob/master/keras_cv/metrics/coco/recall.py#L124

innat commented 1 year ago

@markub3327

Thanks for the hint. I'm not certian to use tf.function in metrics to boost speed. These metrics should always run in graph mode, specifically in model.fit method. Also, I think this is not practiced in keras metrics if we look the source code. AFAIK, keras-cv forcefully used this to speed up some metrics (recall, map) that are computationally expensive, https://github.com/keras-team/keras-cv/pull/1356.

Apart from this, as I reported, this NotImplementedError error occured with mixed precision in TF 2.11 and resolve with tf.function (it's doable but somewhat anti-pattern). But TF 2.9 (as tested), compiling with tf.function is not required.

SuryanarayanaY commented 1 year ago

Hi @innat ,

The reported error reproducible with TF2.11v and with TF2.9 there is no such error.Please refer attached gist. Same problem exists in tf-nightly(2.13.0-dev20230307) also.

Needs to be investigated. Thanks!

qlzh727 commented 1 year ago

Adding Reed here who works on mixed precision. The F1 implementation looks normal to me in https://github.com/tensorflow/addons/blob/master/tensorflow_addons/metrics/f_scores.py

keras-team / tf-keras

NotImplementedError: call `variable.read_value()` inside variable_sync_on_read_context is not supported #251