ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.9k stars 5.76k forks source link

[Tensorflow] Trainer example does not run #47464

Open crbellis opened 2 months ago

crbellis commented 2 months ago

What happened + What you expected to happen

I was trying to run this example from the documentation however it results in an error. I've tested this on 2 different clusters, one with CPU only and GPU set to false, the other with a cluster of GPUs.

Tensorflow example here.

The error is

ValueError: Attempt to convert a value (PerReplica:{
  0: <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>
}) with an unsupported type (<class 'tensorflow.python.distribute.values.PerReplica'>) to a Tensor.

I expected the sample from the docs to run successfully.

Versions / Dependencies

ray==2.30.0 python==3.11

Reproduction script

No changes made to this code from the doc.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

crbellis commented 1 month ago

Similar error from here...

crbellis commented 1 month ago

And a toy example here:

import ray
import tensorflow as tf
from ray.train.tensorflow import TensorflowTrainer
from ray.train import ScalingConfig

def build_model():
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Dense(128, activation="relu"))
    model.add(tf.keras.layers.Dense(10))
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model

def train_func(config):
    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    with strategy.scope():
        model = build_model()

        dataset = ray.train.get_dataset_shard("train")
        tf_dataset = dataset.to_tf(
            feature_columns="x", label_columns="y", batch_size=32
        )
        print("TF DATASET: ")
        print(tf_dataset)

    model.fit(tf_dataset, epochs=5)

train_dataset = ray.data.from_items([{"x": x / 10, "y": x % 10} for x in range(1000)])
scaling_config = ScalingConfig(num_workers=2, use_gpu=False)

trainer = TensorflowTrainer(
    train_loop_per_worker=train_func,
    datasets={"train": train_dataset},
    scaling_config=scaling_config,
)

results = trainer.fit()
print(results.metrics)

Error:

ValueError: Attempt to convert a value (PerReplica:{
  0: <tf.Tensor: shape=(16,), dtype=float64, numpy=
array([0. , 0.1, 0.2, 0.3, 0.4, 1. , 1.1, 1.2, 1.3, 1.4, 2. , 2.1, 2.2,
       2.3, 2.4, 3. ])>
}) with an unsupported type (<class 'tensorflow.python.distribute.values.PerReplica'>) to a Tensor.
crbellis commented 1 month ago

For more context, this issue was when running tensorflow==2.16.1. Bumping the version down to tensorflow==2.15.1 fixed this. Seems like there is some compatibility issue with this tf version

beck-weber-ing commented 1 month ago

seeing this issue on custom code using TensorflowTrainer and MultiWorkerMirroredStrategy.

versions:

# pip freeze | grep "tensor\|ray"
memray==1.14.0
ray==2.37.0
tensorboard==2.17.1
tensorboard-data-server==0.7.2
tensorboardX==2.6.2.2
tensorflow==2.17.0
tensorflow-addons==0.23.0
tensorflow-io-gcs-filesystem==0.37.1
crbellis commented 1 month ago

@beck-weber-ing btw, I'm not seeing this on tensorflow==2.15.1. So (edit: it's been so long I forgot I already shared this, apologies!) something must've changed on tf side that is potentially breaking the ray trainer

beck-weber-ing commented 1 month ago

for me the error looks like this and happens upon calling model.fit (train.py:210):

ray.exceptions.RayTaskError(ValueError): ray::_Inner.train() (pid=25075, ip=10.164.0.45, actor_id=b5e4e61bdbcffb83e632876f20000000, repr=TensorflowTrainer)                                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train                                                                                                                                                                     
    raise skipped from exception_cause(skipped)                                                                                                                                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/utils.py", line 57, in check_for_failure                                                                                                                                                             
    ray.get(object_ref)                                                                                                                                                                                                                                                  
ray.exceptions.RayTaskError(ValueError): ray::_RayTrainWorker__execute.get_next() (pid=25120, ip=10.164.0.45, actor_id=9ccd80edf5393c37186db10920000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7d161324e2f0>)                                
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 33, in __execute                                                                                                                                                              
    raise skipped from exception_cause(skipped)                                                                                                                                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper                                                                                                                                                       
    train_func(*args, **kwargs)                                                                                                                                                                                                                                          
  File "/workspace/train.py", line 210, in f                                                                                                                                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler                                                                                                                                                          
    raise e.with_traceback(filtered_tb) from None                                                                                                                                                                                                                        
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py", line 108, in convert_to_eager_tensor                                                                                                                                        
    return ops.EagerTensor(value, ctx.device_name, dtype)                                                                                                                                                                                                                
ValueError: Attempt to convert a value (PerReplica:{                                                                                                                                                                                                                     
  0: <tf.Tensor: shape=(1, 125, 21), dtype=float64, numpy=                                                                                                                                                                                                               
array([[[........]]])>                                                                                                                                                                                                             
}) with an unsupported type (<class 'tensorflow.python.distribute.values.PerReplica'>) to a Tensor.