opendatahub-io / odh-dashboard

Dashboard for ODH
Apache License 2.0
28 stars 163 forks source link

Update SLIs/SLOs for Dashboard component #1747

Closed maryfrances01 closed 8 months ago

maryfrances01 commented 1 year ago

We need to update the SLIs/SLOs for all components. This issue is for updating the RHODS Dashboard component.

This issue will require two steps:

  1. Updating the SLIs/SLOs offline for your component.
  2. Meeting with other component leads/QE and discussing/refining the SLIs/SLOs (I'll schedule the meeting once all components are updated).

I will share the document that needs to be updated, when this issue is assigned.

If this could be completed in the next couple of weeks, that would be great.

lucferbux commented 1 year ago

@maryfrances01 Hi! I've just reviewed https://github.com/opendatahub-io/odh-dashboard/issues/1702 and it seems the old SLOs are quite outdated, we have a lot of new components right now. Our main issue here is that most of our issues are raised in the browser, and no longer are in the pods, that's something we might wanna discuss cc @andrewballantyne @christianvogt @alexcreasy

maryfrances01 commented 1 year ago

Hi @lucferbux thanks for looking at 1702! That issue and this one are separate though.

1702 is about reviewing the SOPS that SREs use when we are alerted on issues. We are hoping to move them to KCS articles so that CEE could use them as well.

This issue is for updating the SLIs/SLOs. You can access the link for updating the SLIs/SLOs here: https://docs.google.com/document/d/193TCezqdJogX1PvuDGAQDjhyEMv3DeZEpVeDrhenj4k/edit#heading=h.6y6mpgy6grs2

Let me know if you have any questions about the SLIs/SLOs.

lucferbux commented 1 year ago

Ok, yes, I'm sorry, I got confused.

@andrewballantyne I can take the lead here but I would love to have some discussions about it, I think several people should review this to share some ideas, given the past experiences we've had with 500 errors and pings to SREs. We can plan this ahead for next sprint and sync.

maryfrances01 commented 1 year ago

@lucferbux and @andrewballantyne

Yes, this is just a first pass at updating the SLIs/SLOs. Once everyone has updated the SLIs/SLOs for their components, and everyone has a chance to comment on the document, we will all meet an discuss how they can be improved more, if needed.

lucferbux commented 1 year ago

Yes, update, some of the User Journeys are outside of scope right now for the dashboard. We mostly use client side requests, so we don't have observability in our pod, we can take a further look about that, I'm sure there's a way to monitor pass through, but the current set up with prometheus will not handle that. cc @andrewballantyne @alexcreasy @christianvogt any thoughts here?

andrewballantyne commented 1 year ago

We have a record-based logging system (outside of Pod logs) Landon put together for adminActivity. But we don't have general information like requests for all users and other observability other than whatever is naturally done by OpenShift Console on Pods.

What kind of observability do we need?

lucferbux commented 1 year ago

We can have a conversation about this, as the Dashboard is mainly a reflection of Openshift and the rest of the components, I think more or less everything is covered, but there are plenty of edge cases that we are adding that would be amazing to be covered (i.e. roles in projects and so). Not sure how much of that is relevant enough, but for example for the As a data scientist, I want to serve a model by creating a model server and then serving the saved model. User Journey, maybe logging the issue of failed rolebindings when enabling route security can be interesting.

lucferbux commented 10 months ago

cc @dgutride

dgutride commented 8 months ago

We have migrated all work to Jira - please contact me off line to talk about how to move this forward if this is still a requirement or there is any remaining work on it.