kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.51k stars 443 forks source link

TypeError: unary_unary() got an unexpected keyword argument '_registered_method' #2427

Open Electronic-Waste opened 2 months ago

Electronic-Waste commented 2 months ago

What happened?

When I run the following scripts:

import kubeflow.katib as katib

def train_mnist_model(parameters):
    import tensorflow as tf
    import kubeflow.katib as katib
    import numpy as np
    import logging

    logging.basicConfig(
        format="%(asctime)s %(levelname)-8s %(message)s",
        datefmt="%Y-%m-%dT%H:%M:%SZ",
        level=logging.INFO,
    )
    logging.info("--------------------------------------------------------------------------------------")
    logging.info(f"Input Parameters: {parameters}")
    logging.info("--------------------------------------------------------------------------------------\n\n")

    # Get HyperParameters from the input params dict.
    lr = float(parameters["lr"])
    num_epoch = int(parameters["num_epoch"])

    # Set dist parameters and strategy.
    is_dist = parameters["is_dist"]
    num_workers = parameters["num_workers"]
    batch_size_per_worker = 64
    batch_size_global = batch_size_per_worker * num_workers
    strategy = tf.distribute.MultiWorkerMirroredStrategy(
        communication_options=tf.distribute.experimental.CommunicationOptions(
            implementation=tf.distribute.experimental.CollectiveCommunication.RING
        )
    )

    # Callback class for logging training.
    # Katib parses metrics in this format: <metric-name>=<metric-value>.
    class CustomCallback(tf.keras.callbacks.Callback):
        def on_epoch_end(self, epoch, logs=None):
            katib.report_metrics({
                "accuracy": logs["accuracy"],
                "logs": logs["loss"],
            })

    # Prepare MNIST Dataset.
    def mnist_dataset(batch_size):
        (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
        x_train = x_train / np.float32(255)
        y_train = y_train.astype(np.int64)
        train_dataset = (
            tf.data.Dataset.from_tensor_slices((x_train, y_train))
            .shuffle(60000)
            .repeat()
            .batch(batch_size)
        )
        return train_dataset

    # Build and compile CNN Model.
    def build_and_compile_cnn_model():
        model = tf.keras.Sequential(
            [
                tf.keras.layers.InputLayer(input_shape=(28, 28)),
                tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
                tf.keras.layers.Conv2D(32, 3, activation="relu"),
                tf.keras.layers.Flatten(),
                tf.keras.layers.Dense(128, activation="relu"),
                tf.keras.layers.Dense(10),
            ]
        )
        model.compile(
            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            optimizer=tf.keras.optimizers.SGD(learning_rate=lr),
            metrics=["accuracy"],
        )
        return model

    # Download Dataset.
    dataset = mnist_dataset(batch_size_global)

    # For dist strategy we should build model under scope().
    if is_dist:
        logging.info("Running Distributed Training")
        logging.info("--------------------------------------------------------------------------------------\n\n")
        with strategy.scope():
            model = build_and_compile_cnn_model()
    else:
        logging.info("Running Single Worker Training")
        logging.info("--------------------------------------------------------------------------------------\n\n")
        model = build_and_compile_cnn_model()

    # Start Training.
    model.fit(
        dataset,
        epochs=num_epoch,
        steps_per_epoch=70,
        callbacks=[CustomCallback()],
        verbose=0,
    )

# Set parameters with their distribution for HyperParameter Tuning with Katib.
parameters = {
    "lr": katib.search.double(min=0.1, max=0.2),
    "num_epoch": katib.search.int(min=10, max=15),
    "is_dist": False,
    "num_workers": 1
}

# Start the Katib Experiment.
katib_client = katib.KatibClient(namespace="kubeflow")
katib_client.tune(
    name="tune-mnist",
    objective=train_mnist_model, # Objective function.
    base_image="electronicwaste/tensorflow:git", # tensorflow/tensorflow:2.13.0 + git
    parameters=parameters, # HyperParameters to tune.
    algorithm_name="cmaes", # Alorithm to use.
    objective_metric_name="accuracy", # Katib is going to optimize "accuracy".
    additional_metric_names=["loss"], # Katib is going to collect these metrics in addition to the objective metric.
    max_trial_count=12, # Trial Threshold.
    parallel_trial_count=2,
    packages_to_install=["git+https://github.com/kubeflow/katib.git@master#subdirectory=sdk/python/v1beta1"],
    metrics_collector_config={"kind": "Push"},
)

The error happened:

Traceback (most recent call last):
  File "/tmp/tmp.fGitfCta5x/ephemeral_objective.py", line 97, in <module>
    train_mnist_model({'lr': '0.16377224201308005', 'num_epoch': '13', 'is_dist': False, 'num_workers': 1})
  File "/tmp/tmp.fGitfCta5x/ephemeral_objective.py", line 89, in train_mnist_model
    model.fit(
  File "/usr/local/lib/python3.8/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/tmp/tmp.fGitfCta5x/ephemeral_objective.py", line 36, in on_epoch_end
    katib.report_metrics({
  File "/usr/local/lib/python3.8/dist-packages/kubeflow/katib/api/report_metrics.py", line 61, in report_metrics
    client = katib_api_pb2_grpc.DBManagerStub(channel)
  File "/usr/local/lib/python3.8/dist-packages/kubeflow/katib/katib_api_pb2_grpc.py", line 19, in __init__
    self.ReportObservationLog = channel.unary_unary(
TypeError: unary_unary() got an unexpected keyword argument '_registered_method'

What did you expect to happen?

Run without error.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.1

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
docker.io/kubeflowkatib/katib-controller:lates

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.17.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: premnath.vel@gmail.com
License: Apache License Version 2.0
Location: /home/ws/miniconda3/envs/katib/lib/python3.10/site-packages
Requires: certifi, grpcio, kubernetes, protobuf, setuptools, six, urllib3
Required-by: 

Python Packages Version in the Training Container:

$ pip list
Package                      Version
---------------------------- --------------------
absl-py                      1.4.0
astunparse                   1.6.3
cachetools                   5.3.1
certifi                      2019.11.28
chardet                      3.0.4
dbus-python                  1.2.16
flatbuffers                  23.5.26
gast                         0.4.0
google-auth                  2.21.0
google-auth-oauthlib         1.0.0
google-pasta                 0.2.0
grpcio                       1.56.0
h5py                         3.9.0
idna                         2.8
importlib-metadata           6.7.0
keras                        2.13.1
kubeflow-katib               0.17.0
kubernetes                   30.1.0
libclang                     16.0.0
Markdown                     3.4.3
MarkupSafe                   2.1.3
numpy                        1.24.3
oauthlib                     3.2.2
opt-einsum                   3.3.0
packaging                    23.1
pip                          23.1.2
protobuf                     4.23.3
pyasn1                       0.5.0
pyasn1-modules               0.3.0
PyGObject                    3.36.0
python-apt                   2.0.1+ubuntu0.20.4.1
python-dateutil              2.9.0.post0
PyYAML                       6.0.2
requests                     2.22.0
requests-oauthlib            1.3.1
requests-unixsocket          0.2.0
rsa                          4.9
setuptools                   68.0.0
six                          1.14.0
tensorboard                  2.13.0
tensorboard-data-server      0.7.1
tensorflow-cpu               2.13.0
tensorflow-estimator         2.13.0
tensorflow-io-gcs-filesystem 0.32.0
termcolor                    2.3.0
typing_extensions            4.5.0
urllib3                      1.25.8
websocket-client             1.8.0
Werkzeug                     2.3.6
wheel                        0.40.0
wrapt                        1.15.0
zipp                         3.15.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Electronic-Waste commented 2 months ago

FYR, I found a similar issue describing this error: https://github.com/open-telemetry/opentelemetry-python-contrib/issues/2483

Maybe it concerns with the grpcio version.

Electronic-Waste commented 2 months ago

However, it comes to run without error when I use tensorflow/tensorflow:2.17.0 for the base image of my Dockerfile to build a new training image:

FROM tensorflow/tensorflow:2.17.0

RUN apt-get -y update && \
    apt-get -y install git

But it cannot work out when I use tensorflow/tensorflow:2.13.0, which is our base image for users.

I think we should investigate this to ensure that Push MC works correctly. WDYT👀 @kubeflow/wg-automl-leads

Electronic-Waste commented 2 months ago

In tensorflow/tensorflow:2.17.0, the Python packages versions are:

# pip list
Package                      Version
---------------------------- -------------
absl-py                      2.1.0
astunparse                   1.6.3
blinker                      1.4
cachetools                   5.5.0
certifi                      2024.7.4
charset-normalizer           3.3.2
cryptography                 3.4.8
dbus-python                  1.2.18
distro                       1.7.0
flatbuffers                  24.3.25
gast                         0.6.0
google-auth                  2.34.0
google-pasta                 0.2.0
grpcio                       1.64.1
h5py                         3.11.0
httplib2                     0.20.2
idna                         3.7
importlib-metadata           4.6.4
jeepney                      0.7.1
keras                        3.4.1
keyring                      23.5.0
kubeflow-katib               0.17.0
kubernetes                   30.1.0
launchpadlib                 1.10.16
lazr.restfulclient           0.14.4
lazr.uri                     1.0.6
libclang                     18.1.1
Markdown                     3.6
markdown-it-py               3.0.0
MarkupSafe                   2.1.5
mdurl                        0.1.2
ml-dtypes                    0.4.0
more-itertools               8.10.0
namex                        0.0.8
numpy                        1.26.4
oauthlib                     3.2.2
opt-einsum                   3.3.0
optree                       0.12.1
packaging                    24.1
pip                          24.1.2
protobuf                     4.25.3
pyasn1                       0.6.1
pyasn1_modules               0.4.1
Pygments                     2.18.0
PyGObject                    3.42.1
PyJWT                        2.3.0
pyparsing                    2.4.7
python-apt                   2.4.0+ubuntu3
python-dateutil              2.9.0.post0
PyYAML                       6.0.2
requests                     2.32.3
requests-oauthlib            2.0.0
rich                         13.7.1
rsa                          4.9
SecretStorage                3.3.1
setuptools                   70.3.0
six                          1.16.0
tensorboard                  2.17.0
tensorboard-data-server      0.7.2
tensorflow-cpu               2.17.0
tensorflow-io-gcs-filesystem 0.37.1
termcolor                    2.4.0
typing_extensions            4.12.2
urllib3                      2.2.2
wadllib                      1.3.6
websocket-client             1.8.0
Werkzeug                     3.0.3
wheel                        0.43.0
wrapt                        1.16.0
zipp                         1.0.0
Electronic-Waste commented 2 months ago

/good-first-issue /remove-label lifecycle/needs-triage

google-oss-prow[bot] commented 2 months ago

@Electronic-Waste: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/kubeflow/katib/issues/2427): >/good-first-issue >/remove-label lifecycle/needs-triage Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.