operate-first / support

This repo should serve as a central source for users to raise issues/questions/requests for Operate First.
GNU General Public License v3.0
15 stars 25 forks source link

Curator/Monitoring - track resource utilization across projects #763

Closed tumido closed 2 years ago

tumido commented 3 years ago

Colab with Curator/Koku - create dashboards for per cluster/per project resource consumption

cc @durandom @gagansk

HumairAK commented 3 years ago

@tumido bump

gagansk commented 3 years ago

Can we have a call to discuss the Operate First requirements and what the Curator can do as of now? @HumairAK @tumido @durandom

tumido commented 3 years ago

Sure, @durandom's on PTO, feel free to schedule a meeting with me and @HumairAK :+1:

gagansk commented 3 years ago

A document has been created to identify the feature gaps. A PRD will be written for this purpose. We can proceed after @durandom and @hpdempsey are back from PTO.

sesheta commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

gagansk commented 2 years ago

A new milestone has been created in Curator (#22)for a collab between Curator and Operate First.

The Curator team presented a demo of Trino and Superset Dashboards in Operate First to @durandom and @hemajv

HumairAK commented 2 years ago

/remove-lifecycle rotten

hpdempsey commented 2 years ago

The specific input we are looking for at this stage is what the current Operate First admins think would be useful types of reports (presumably in the form of graphs, tables etc. that we demonstrated with Triino/Superset) to help admins and operators carry out reporting and planing activities for the Operate First clusters. Curator reports are based on stored data, and so can cover short or long past time periods (e.g. past day, week or month) and can aggregate various OpenShift API measurements into different types of reports. We would like input on what (if any) reports we should produce regularly for the current admin group, and how the admins would easily access those reports as part of Operate Firs admin workflows.

durandom commented 2 years ago

@HumairAK this should be tied into https://github.com/orgs/operate-first/projects/33 @gagansk the most important information is resource requests (CPU/MEM) per namespace to see the utilization of the environment. Next step would be grouping the namespaces by "team". E.g. ODH requests 33%.

HumairAK commented 2 years ago

yeah we went over some of our existing tools to monitor these items with @leihchen, some examples of current grafana dashboards we use for cpu/mem monitoring:

hpdempsey commented 2 years ago

We are already doing the type of reports suggested by @durandom, but we will polish them up a bit and make them easier to find for the OF ops folks. We are not trying to duplicate the Grafana dashboards.

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta commented 2 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 2 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/operate-first/support/issues/763): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.