kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

Proposal for exposing generic prometheus metrics in common operator #22

Open ywskycn opened 5 years ago

ywskycn commented 5 years ago

Proposal

Add generic metrics (jobs/pods/...) to the common operator, which can be directly enabled and used by operators built base on common operator

Motivation

To track some job-level metrics, currently we need to add prometheus metric code inside each job operator. For example, to know how many tfjobs created in the last hour, we need to add a Counter inside tf-operator. This request is very common and is needed for different operators. As we're moving common code to the common operator, we could also add metric-related code there, and can be used by all operators built base on the common one.

Details

For metric definition and registry, will add a new metrics folder and all metrics will be defined there. Some prelim metrics include # jobs/pods/services created, durations for various operations, etc.

For metrics updating:

As the common project is still under active development, some details discussed above may be changed later. Comments will be very appreciated, @jlewi @richardsliu @gaocegege @jian-he .

issue-label-bot[bot] commented 5 years ago

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.93. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

gaocegege commented 5 years ago

/cc @terrytangyuan

The feature LGTM.

jian-he commented 5 years ago

lgtm, +1

terrytangyuan commented 5 years ago

Sounds great to me. This would be a good way to standardize metrics collection. We could also expose some utility methods that operators can use to collect operator-specific custom metrics, which leads to shared best practices and standards across operators.

richardsliu commented 5 years ago

Sounds great to me.

/cc @jlewi

johnugeorge commented 5 years ago

Great. LGTM One problem that I see is the limited information in Job Controller. If we design the common interfaces well, this is possible.

gaocegege commented 5 years ago

One problem that I see is the limited information in Job Controller. If we design the common interfaces well, this is possible.

Sure. kubebuilder supports the feature, thus I think we can also implement it in common-operator if we design it well.

merlintang commented 5 years ago

LGTM, this looks so good.

yeya24 commented 5 years ago

Any progress for this issue?

gaocegege commented 5 years ago

@yeya24 AFAIK, there is no one working on it now.

terrytangyuan commented 4 years ago

Hi all, I added a detailed outline of the Prometheus metrics we plan to coverage in common operator in https://github.com/kubeflow/common/pull/77. Please take a look and any feedback would be appreciated.