Open ywskycn opened 5 years ago
Issue-Label Bot is automatically applying the label feature_request
to this issue, with a confidence of 0.93. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
/cc @terrytangyuan
The feature LGTM.
lgtm, +1
Sounds great to me. This would be a good way to standardize metrics collection. We could also expose some utility methods that operators can use to collect operator-specific custom metrics, which leads to shared best practices and standards across operators.
Sounds great to me.
/cc @jlewi
Great. LGTM One problem that I see is the limited information in Job Controller. If we design the common interfaces well, this is possible.
One problem that I see is the limited information in Job Controller. If we design the common interfaces well, this is possible.
Sure. kubebuilder supports the feature, thus I think we can also implement it in common-operator if we design it well.
LGTM, this looks so good.
Any progress for this issue?
@yeya24 AFAIK, there is no one working on it now.
Hi all, I added a detailed outline of the Prometheus metrics we plan to coverage in common operator in https://github.com/kubeflow/common/pull/77. Please take a look and any feedback would be appreciated.
Proposal
Add generic metrics (jobs/pods/...) to the common operator, which can be directly enabled and used by operators built base on common operator
Motivation
To track some job-level metrics, currently we need to add prometheus metric code inside each job operator. For example, to know how many tfjobs created in the last hour, we need to add a Counter inside tf-operator. This request is very common and is needed for different operators. As we're moving common code to the common operator, we could also add metric-related code there, and can be used by all operators built base on the common one.
Details
For metric definition and registry, will add a new
metrics
folder and all metrics will be defined there. Some prelim metrics include # jobs/pods/services created, durations for various operations, etc.For metrics updating:
As the common project is still under active development, some details discussed above may be changed later. Comments will be very appreciated, @jlewi @richardsliu @gaocegege @jian-he .