kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.48k stars 439 forks source link

[SDK] grpc-related bugs in Python SDK #2395

Closed Electronic-Waste closed 2 weeks ago

Electronic-Waste commented 1 month ago

What happened?

image

image

When I called report_metrics in the SDK built by myself, the first error above occurred.

After I modified grpc.beta.implementations.insecure_channel to grpc.insecure_channel, the second error above occurred.

What did you expect to happen?

After PR #2344 was merged to the main branch, beta_create_DBManager_stub was removed in api_pb2.py. But we did not make changes in the following part of codes so that the first error occurred https://github.com/kubeflow/katib/blob/a6c37e4f3a64ca7f626a4d80f1c3f88362bf03a4/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py#L1288-L1294

https://github.com/kubeflow/katib/blob/a6c37e4f3a64ca7f626a4d80f1c3f88362bf03a4/sdk/python/v1beta1/kubeflow/katib/api/report_metrics.py#L66-L69

Also grpcio package after 1.0.0 has removed beta version. Sometimes we may get errors when we use beta version.

https://github.com/kubeflow/katib/blob/a6c37e4f3a64ca7f626a4d80f1c3f88362bf03a4/sdk/python/v1beta1/kubeflow/katib/api/report_metrics.py#L57-L59

https://github.com/kubeflow/katib/blob/a6c37e4f3a64ca7f626a4d80f1c3f88362bf03a4/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py#L1284-L1286

I think we need to may some changes to Python SDK. WDYT👀 @andreyvelich @tenzen-y @johnugeorge

Environment

Kubernetes version:

Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.1

Katib controller version:

Built by myself

Katib Python SDK version:

(Also built by myself)
Name: kubeflow-katib
Version: 0.17.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: premnath.vel@gmail.com
License: Apache License Version 2.0
Location: /home/xxx/.local/lib/python3.10/site-packages/kubeflow_katib-0.17.0-py3.10.egg
Requires: certifi, grpcio, kubernetes, protobuf, setuptools, six, urllib3
Required-by: 

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Electronic-Waste commented 1 month ago

related PR: #2050

I think we also need to copy api_pb2_grpc.py to Python SDK.

Electronic-Waste commented 1 month ago

I encountered this bug while making the push-based metrics collection demo. This issue is also related to the project.

ref issue: #2340

tenzen-y commented 1 month ago

Could you implement the reproducible UTs or E2E test for the 'report_metrics? IIUC, report_metrics is added without any tests.

Electronic-Waste commented 1 month ago

Yes, I think we should also add some UTs or E2E tests for functions in katib_client.py like #2325 too.

I guess it will be better to describe the test case problem in another issue and open PRs to it.

WDYT👀 @tenzen-y @andreyvelich @johnugeorge @gaocegege

tenzen-y commented 1 month ago

Yes, I think we should also add some UTs or E2E tests for functions in katib_client.py like #2325 too.

I guess it will be better to describe the test case problem in another issue and open PRs to it.

WDYT👀 @tenzen-y @andreyvelich @johnugeorge @gaocegege

I have strong objections to evolving the report_metrics API without any tests since the report_metrics API is new one. The report_metrics can not be handled the same as other Python API.

Before we move forward, we should implement the UTs for the report_metrcs so that we prove that the new feature doesn't have regressions. We should not add new features without tests.

Whether the feature doesn't have potential regressions are big deal since Katib is not experimental project, Katib is production project.

andreyvelich commented 1 month ago

I agree with @tenzen-y, please let's work on the unit test for report_metrics API and for tune API first to avoid this problems in the first phase.

Electronic-Waste commented 1 month ago

@tenzen-y @andreyvelich UTs for report_metrics has been added by now.

I'll work on UTs for tune API after this issue is addressed.

andreyvelich commented 1 month ago

/remove-label lifecycle/needs-triage

google-oss-prow[bot] commented 1 month ago

@andreyvelich: The label(s) /remove-label lifecycle/needs-triage cannot be applied. These labels are supported: tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to [this](https://github.com/kubeflow/katib/issues/2395#issuecomment-2256643683): >/remove-label lifecycle/needs-triage Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
andreyvelich commented 1 month ago

/area sdk