[Bug] Serialization error during model training

Tixxx commented 2 years ago

Search before asking

[X] I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

Symptoms: Running model training using remote ray executors fails with strange errors like "KeyError" or "AttributeError" when the object indeed has the key or attribute. Example stack trace:

2022-02-07 16:24:40,098 ERROR serialization.py:270 -- 'tensorflow.keras.backend' (pid=62791) Traceback (most recent call last): (pid=62791) File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/ray/serialization.py", line 268, in deserialize_objects (pid=62791) obj = self._deserialize_object(data, metadata, object_ref) (pid=62791) File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/ray/serialization.py", line 191, in _deserialize_object (pid=62791) return self._deserialize_msgpack_data(data, metadata_fields) (pid=62791) File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/ray/serialization.py", line 169, in _deserialize_msgpack_data (pid=62791) python_objects = self._deserialize_pickle5_data(pickle5_data) (pid=62791) File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/ray/serialization.py", line 159, in _deserialize_pickle5_data (pid=62791) obj = pickle.loads(in_band) (pid=62791) File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/tensorflow_core/python/util/module_wrapper.py", line 240, in setstate (pid=62791) self.init(sys.modules[d]._tfmw_wrapped_module, (pid=62791) KeyError: 'tensorflow.keras.backend'

Expected behavior: Running the repro script should pass without any error.

Versions / Dependencies

python: 3.6.9 OS: Ubuntu 16

Package                           Version
--------------------------------- -----------
absl-py                           1.0.0
aiohttp                           3.8.1
aiosignal                         1.2.0
appdirs                           1.4.4
appnope                           0.1.2
astor                             0.8.1
astroid                           2.9.3
async-timeout                     4.0.2
asynctest                         0.13.0
attrs                             21.2.0
avro-python3                      1.9.1
backcall                          0.2.0
backports.entry-points-selectable 1.1.0
bcrypt                            3.2.0
black                             21.12b0
boto                              2.49.0
boto3                             1.18.21
botocore                          1.21.21
cachetools                        4.2.2
certifi                           2021.5.30
cffi                              1.14.6
charset-normalizer                2.0.4
click                             8.0.3
cloudpickle                       1.6.0
colorama                          0.4.4
comet-ml                          3.14.0
configobj                         5.0.6
contextlib2                       21.6.0
coverage                          5.5
crcmod                            1.7
cryptography                      3.4.7
cycler                            0.10.0
Cython                            0.29.24
dataclasses                       0.8
decorator                         5.0.9
Deprecated                        1.2.13
dill                              0.3.4
diskcache                         5.2.1
distlib                           0.3.3
docopt                            0.6.2
drogon-cli                        0.2.9
dulwich                           0.20.24
everett                           2.0.0
execnet                           1.9.0
filelock                          3.2.0
fire                              0.2.1
flake8                            3.9.2
flipr-client                      16.1.2
frozenlist                        1.2.0
fsspec                            2021.7.0
future                            0.16.0
futures                           3.1.1
gast                              0.2.2
gitdb                             4.0.7
GitPython                         3.1.3
google-api-core                   1.31.1
google-api-python-client          2.15.0
google-auth                       1.34.0
google-auth-httplib2              0.1.0
google-auth-oauthlib              0.4.6
google-cloud-core                 1.7.2
google-cloud-storage              1.42.0
google-crc32c                     1.1.2
google-pasta                      0.2.0
google-resumable-media            1.3.3
googleapis-common-protos          1.53.0
grako                             3.6.7
grpcio                            1.29.0
grpcio-tools                      1.29.0
h5py                              2.10.0
heatpipe                          1.0.9
horovod                           0.23.0
httplib2                          0.19.1
humanize                          3.11.0
idna                              3.2
idna-ssl                          1.1.0
importlib-metadata                4.6.4
importlib-resources               5.2.2
iniconfig                         1.1.1
ipaddress                         1.0.23
ipython                           7.9.0
ipython-genutils                  0.2.0
isort                             5.10.1
jedi                              0.18.0
Jinja2                            3.0.1
jmespath                          0.10.0
joblib                            1.0.1
jsonschema                        3.2.0
kafka-rest-py                     0.3.4
kazoo                             2.8.0
Keras                             2.3.0
Keras-Applications                1.0.8
Keras-Preprocessing               1.1.2
kiwisolver                        1.3.1
lazy-object-proxy                 1.7.1
m3                                4.3.1
Markdown                          3.3.4
MarkupSafe                        2.0.1
matplotlib                        3.3.4
mccabe                            0.6.1
mock                              4.0.3
msgpack                           1.0.3
multidict                         5.2.0
mypy-extensions                   0.4.3
networkx                          2.2
neuropod                          0.3.0rc4
numpy                             1.19.5
opentracing                       2.4.0
opentracing-instrumentation       3.3.1
opt-einsum                        3.3.0
packaging                         21.0
pandas                            1.1.5
paramiko                          2.7.2
parso                             0.8.2
pathspec                          0.9.0
peloton-client                    0.9.14
pep517                            0.12.0
petastorm                         0.11.2
pexpect                           4.8.0
pickle5                           0.0.12
pickleshare                       0.7.5
Pillow                            8.3.1
pip                               21.3.1
pip-tools                         6.4.0
platformdirs                      2.4.0
pluggy                            1.0.0
ply                               3.11
prompt-toolkit                    2.0.10
protobuf                          3.17.3
psutil                            5.8.0
PTable                            0.9.2
ptyprocess                        0.7.0
py                                1.10.0
py4j                              0.10.7
pyarrow                           4.0.1
pyasn1                            0.4.8
pyasn1-modules                    0.2.8
pycodestyle                       2.7.0
pycparser                         2.20
pyDeprecate                       0.3.0
pyfakefs                          4.5.4
pyflakes                          2.3.1
Pygments                          2.10.0
PyJWT                             1.7.1
pylint                            2.12.2
PyNaCl                            1.4.0
pyparsing                         2.4.7
pyrsistent                        0.18.0
PySocks                           1.7.1
pyspark                           2.4.3
pytest                            6.2.5
pytest-forked                     1.4.0
pytest-xdist                      2.5.0
python-dateutil                   2.8.2
python-json-logger                2.0.2
pytorch-lightning                 1.3.8
pytz                              2021.1
PyYAML                            5.4
pyzmq                             22.2.1
ray                               2.0.0.dev0
redis                             4.1.2
remote-pdb                        2.1.0
requests                          2.26.0
requests-mock                     1.9.3
requests-oauthlib                 1.3.0
requests-toolbelt                 0.9.1
requirements-parser               0.2.0
retrying                          1.3.3
rsa                               4.7.2
rules-engine                      0.7.3
scikit-learn                      0.24.2
scipy                             1.5.4
semantic-version                  2.8.5
semver                            2.13.0
setuptools                        49.6.0
six                               1.16.0
sklearn                           0.0
smmap                             4.0.0
sqlparse                          0.4.1
subprocess32                      3.5.4
tabulate                          0.8.9
tchannel                          2.1.0
tensorboard                       2.8.0
tensorboard-data-server           0.6.1
tensorboard-plugin-wit            1.8.1
tensorflow                        1.15.3
tensorflow-estimator              1.15.1
threadloop                        1.0.2
threadpoolctl                     2.2.0
thriftrw                          1.8.1
toml                              0.10.2
tomli                             1.2.2
torch                             1.5.0
torchmetrics                      0.7.0
torchvision                       0.6.0
tornado                           4.5.3
tox                               3.24.4
tqdm                              4.62.3
traitlets                         4.3.3
treelib                           1.6.1
typed-ast                         1.5.2
typing-extensions                 3.10.0.0
uavro                             0.2.3
ujson                             4.0.2
uritemplate                       3.0.1
urllib3                           1.26.6
virtualenv                        20.8.1
watchdog                          2.1.3
wcwidth                           0.2.5
websocket-client                  1.2.1
wheel                             0.37.0
wrapt                             1.12.1
wurlitzer                         2.1.1
xgboost                           1.5.2
xgboost-ray                       0.1.5
yarl                              1.7.2
zipp                              3.5.0

Reproduction script

repro

Anything else

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Tixxx commented 2 years ago

@richardliaw

richardliaw commented 2 years ago


import sys
sys.path.append('.')  # adds local folder
import math
import horovod.tensorflow.keras as hvd
import horovod.spark.keras as hvd_keras
import functools
from functools import partial
import tensorflow as tf

import horovod.tensorflow.keras as hvd
import os
import glob
import tensorflow.keras.backend as K
import ray

import argparse

class LocalRayContext():
    '''
    Keep this non-client ray context here for two reasons:
    1. in certain circumstance, this way to access ray is faster than ray client
    2. without client layer, ray access is simpler and less port conflict risk, easier debug
    '''

    def __init__(self):
        # Start the Ray cluster or attach to an existing Ray cluster
        # will generate such error if shutdown before new ray init.
        # (pid=raylet) [2021-02-02 04:00:06,190 E 9518 9518] process.cc:498:
        # Failed to kill process 9577 with error system:3: No such process
        # ray.shutdown()
        # ray.init()
        # therefore instead we use ignore_reinit_error to avoid shutdown
        try:
            ray.init(address='auto', ignore_reinit_error=True)
        except Exception as e:
            print(f'LocalRayContext __init__ ray.init address=auto failed with error: {e}')
            try:
                ray.init(ignore_reinit_error=True)
            except Exception as e2:
                print(f'LocalRayContext __init__ ray.init failed again with error: {e2}')

    def run(self, fn, args=[], kwargs={}, env=None):

        @ray.remote(max_calls=1)
        def run_ray_remote(func):
            return func(None)

        ans = [run_ray_remote.remote(lambda w: fn(*args, **kwargs))]
        return ray.get(ans)

    def shutdown(self):
        ray.shutdown()

class LocalRayDLContext(LocalRayContext):

    def __init__(self, num_workers=1, optimizer='horovod'):
        super().__init__()
        if optimizer != 'horovod':
            raise Exception(
                'At the moment, synchronous SGD based on horovod '
                'is the only optimizer supported'
            )
        from horovod.ray import RayExecutor
        self.num_workers, self.cpus_per_worker = num_workers, 1
        self.ray_executor = RayExecutor(
            RayExecutor.create_settings(timeout_s=30),
            num_workers=self.num_workers,
            cpus_per_worker=self.cpus_per_worker,
            use_gpu=False,
            gpus_per_worker=0,
        )
        self.ray_executor.start()
        print(
            f'LocalRayDLContext initialized with 1 host and {self.num_workers} slot(s)')

    def num_processes(self):
        return self.num_workers * self.cpus_per_worker

    def run(self, fn, args=[], kwargs={}, env=None):
        """Executes the provided function on all workers.
        Args:
            fn: Target function that can be executed with arbitrary
                args and keyword arguments.
            args: List of arguments to be passed into the target function.
            kwargs: Dictionary of keyword arguments to be
                passed into the target function.
        Returns:
            Deserialized return values from the target function.
        """
        print(f'env is {env}, which will not be used in local ray context')
        return self.ray_executor.run(fn, args, kwargs)

    def shutdown(self):
        self.ray_executor.shutdown()
        super().shutdown()
        print('local ray context shutdown')

# ================ SUBCLASS NETWORK FOR TRAINING =====================
import horovod.spark.keras.util as keras_util

worker_count=4
def remote_trainer():
    def keras_fn():
        def fn():
            import tensorflow.keras as tf_keras
            return tf_keras
        return fn
    def horovod_fn():
        def fn():
            import horovod.tensorflow.keras as hvd
            return hvd
        return fn
    pin_gpu = hvd_keras.remote._pin_gpu_fn()

    get_keras = keras_fn()
    get_horovod = horovod_fn()
    def train_function(input):

        k = get_keras()
        float_type = tf.keras.backend.floatx()
        #float_type = K.floatx() -> this works

        print(f"horovod.spark.keras.util: using keras in tensorflow=={tf.__version__}")
        print(float_type)
        hvd = get_horovod()
        hvd.init()
        pin_gpu(hvd, tf, k)
        y_true = tf.random.uniform([2, 3])
        y_pred = tf.random.uniform([2, 3])
        ce = K.binary_crossentropy(y_true, y_pred, from_logits=False)
        print(ce)
        return 0
    return train_function

################Use ray################
def get_local_ray_ctx():
    print("use not ray client based horovod ray executor")
    return LocalRayDLContext(worker_count)

backend = get_local_ray_ctx()

trainer = remote_trainer()

handle = backend.run(trainer, args=([2]))

print("######finish training")

richardliaw commented 2 years ago

I can't reproduce the error, it works well for me.

@Tixxx can you try making this change:


    get_keras = keras_fn()
    get_horovod = horovod_fn()
    def train_function(input):

        k = get_keras()
+      import tensorflow as tf
        float_type = tf.keras.backend.floatx()
        #float_type = K.floatx() -> this works

Tixxx commented 2 years ago

It also fails for me with the same error. The only way to make it work for me is to import tensorflow.keras beforehand. keras module in tensorflow is lazily loaded, does that impat serialization in any way?

richardliaw commented 2 years ago

ah yeah that should work. i suspect theres some hidden global state that keras manipulates.

Tixxx commented 2 years ago

We got another occurrence where it's now complaining about verion attribute missing from tensorflow, it repros for me on both linux and mac with this train_function:

    def train_function(input):
        print(f"using keras in tensorflow=={tf.__version__}")
        y_true = tf.random.uniform([2, 3])
        y_pred = tf.random.uniform([2, 3])
        ce = tf.keras.losses.binary_crossentropy(y_true, y_pred, from_logits=False)
        print(ce)
        return 0

stack trace: File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/horovod/ray/runner.py", line 507, in run return ray.get(self.run_remote(fn, args, kwargs)) File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, *kwargs) File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/ray/worker.py", line 1624, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(AttributeError): ray::BaseHorovodWorker.execute() (pid=12749, ip=127.0.0.1, repr=<horovod.ray.worker.BaseHorovodWorker object at 0x7fb8e84457f0>) File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/horovod/ray/runner.py", line 529, in worker.execute.remote(lambda w: fn(args, **kwargs)) File "ray_repro.py", line 107, in train_function print(f"using keras in tensorflow=={tf.version}") File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/tensorflow_core/python/util/module_wrapper.py", line 193, in getattr attr = getattr(self._tfmw_wrapped_module, name) AttributeError: module 'tensorflow' has no attribute 'version'

I think we really need to get a repro to have a proper root cause since this happens randomly every time we add new layers or losses. Could you try this on the uber laptop to see if it repros? Package versions are still the same as the ones in the dependency section.

richardliaw commented 2 years ago

TJ, I think this is a known issue with Ray.

The way to fix it is to make sure tensorflow is always imported in the train_function.

For example, try “import tensorflow as tf” INSIDE train_function before the tf.version check.

Please let me know if that works.

On Wed, Feb 9, 2022 at 10:25 AM TJ Xu @.***> wrote:

We got another occurrence where it's now complaining about verion attribute missing from tensorflow, it repros for me on both linux and mac with this train_function: def train_function(input): print(f"using keras in tensorflow=={tf.version}") ytrue = tf.random.uniform([2, 3]) ypred = tf.random.uniform([2, 3]) ce = tf_.keras.losses.binary_crossentropy(y_true, y_pred, from_logits=False) print(ce) return 0

stack trace: File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/horovod/ray/runner.py", line 507, in run return ray.get(self.run_remote(fn, args, kwargs)) File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, *kwargs) File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/ray/worker.py", line 1624, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(AttributeError): ray::BaseHorovodWorker.execute() (pid=12749, ip=127.0.0.1, repr=<horovod.ray.worker.BaseHorovodWorker object at 0x7fb8e84457f0>) File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/horovod/ray/runner.py", line 529, in worker.execute.remote(lambda w: fn(args, *kwargs)) File "ray_repro.py", line 107, in train_function print(f"using keras in tensorflow=={tf.version}") File "/Users/tix/Uber/ml-code/env/py369/lib/python3.6/site-packages/tensorflow_core/python/util/module_wrapper.py", line 193, in getattr attr = getattr(self._tfmw_wrapped_module, name) AttributeError: module 'tensorflow' has no attribute 'version*'

I think we really need to get a repro to have a proper root cause since this happens randomly every time we add new layers or losses. Could you try this on the uber laptop to see if it repros? Package versions are still the same as the ones in the dependency section.

— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/22195#issuecomment-1034064595, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCRZZISTHFCF5UCGOV3MHDU2KWQHANCNFSM5NY6ZOOQ . You are receiving this because you were mentioned.Message ID: @.***>

Tixxx commented 2 years ago

Yea this works for me. It's just a bit difficult to enforce other users to do the same when they develop models on our platform. Do you know if this known issue with tf is tracked anywhere?

ray-project / ray