Open crbellis opened 2 months ago
And a toy example here:
import ray
import tensorflow as tf
from ray.train.tensorflow import TensorflowTrainer
from ray.train import ScalingConfig
def build_model():
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(128, activation="relu"))
model.add(tf.keras.layers.Dense(10))
model.compile(optimizer="adam", loss="mean_squared_error")
return model
def train_func(config):
strategy = tf.distribute.MultiWorkerMirroredStrategy()
with strategy.scope():
model = build_model()
dataset = ray.train.get_dataset_shard("train")
tf_dataset = dataset.to_tf(
feature_columns="x", label_columns="y", batch_size=32
)
print("TF DATASET: ")
print(tf_dataset)
model.fit(tf_dataset, epochs=5)
train_dataset = ray.data.from_items([{"x": x / 10, "y": x % 10} for x in range(1000)])
scaling_config = ScalingConfig(num_workers=2, use_gpu=False)
trainer = TensorflowTrainer(
train_loop_per_worker=train_func,
datasets={"train": train_dataset},
scaling_config=scaling_config,
)
results = trainer.fit()
print(results.metrics)
Error:
ValueError: Attempt to convert a value (PerReplica:{
0: <tf.Tensor: shape=(16,), dtype=float64, numpy=
array([0. , 0.1, 0.2, 0.3, 0.4, 1. , 1.1, 1.2, 1.3, 1.4, 2. , 2.1, 2.2,
2.3, 2.4, 3. ])>
}) with an unsupported type (<class 'tensorflow.python.distribute.values.PerReplica'>) to a Tensor.
For more context, this issue was when running tensorflow==2.16.1
. Bumping the version down to tensorflow==2.15.1
fixed this. Seems like there is some compatibility issue with this tf version
seeing this issue on custom code using TensorflowTrainer and MultiWorkerMirroredStrategy.
versions:
# pip freeze | grep "tensor\|ray"
memray==1.14.0
ray==2.37.0
tensorboard==2.17.1
tensorboard-data-server==0.7.2
tensorboardX==2.6.2.2
tensorflow==2.17.0
tensorflow-addons==0.23.0
tensorflow-io-gcs-filesystem==0.37.1
@beck-weber-ing btw, I'm not seeing this on tensorflow==2.15.1. So (edit: it's been so long I forgot I already shared this, apologies!) something must've changed on tf side that is potentially breaking the ray trainer
for me the error looks like this and happens upon calling model.fit (train.py:210):
ray.exceptions.RayTaskError(ValueError): ray::_Inner.train() (pid=25075, ip=10.164.0.45, actor_id=b5e4e61bdbcffb83e632876f20000000, repr=TensorflowTrainer)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(ValueError): ray::_RayTrainWorker__execute.get_next() (pid=25120, ip=10.164.0.45, actor_id=9ccd80edf5393c37186db10920000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7d161324e2f0>)
File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
train_func(*args, **kwargs)
File "/workspace/train.py", line 210, in f
File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py", line 108, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
ValueError: Attempt to convert a value (PerReplica:{
0: <tf.Tensor: shape=(1, 125, 21), dtype=float64, numpy=
array([[[........]]])>
}) with an unsupported type (<class 'tensorflow.python.distribute.values.PerReplica'>) to a Tensor.
What happened + What you expected to happen
I was trying to run this example from the documentation however it results in an error. I've tested this on 2 different clusters, one with CPU only and GPU set to false, the other with a cluster of GPUs.
Tensorflow example here.
The error is
I expected the sample from the docs to run successfully.
Versions / Dependencies
ray==2.30.0 python==3.11
Reproduction script
No changes made to this code from the doc.
Issue Severity
Medium: It is a significant difficulty but I can work around it.