Closed maryfrances01 closed 8 months ago
@maryfrances01 Hi! I've just reviewed https://github.com/opendatahub-io/odh-dashboard/issues/1702 and it seems the old SLOs are quite outdated, we have a lot of new components right now. Our main issue here is that most of our issues are raised in the browser, and no longer are in the pods, that's something we might wanna discuss cc @andrewballantyne @christianvogt @alexcreasy
Hi @lucferbux thanks for looking at 1702! That issue and this one are separate though.
1702 is about reviewing the SOPS that SREs use when we are alerted on issues. We are hoping to move them to KCS articles so that CEE could use them as well.
This issue is for updating the SLIs/SLOs. You can access the link for updating the SLIs/SLOs here: https://docs.google.com/document/d/193TCezqdJogX1PvuDGAQDjhyEMv3DeZEpVeDrhenj4k/edit#heading=h.6y6mpgy6grs2
Let me know if you have any questions about the SLIs/SLOs.
Ok, yes, I'm sorry, I got confused.
@andrewballantyne I can take the lead here but I would love to have some discussions about it, I think several people should review this to share some ideas, given the past experiences we've had with 500 errors and pings to SREs. We can plan this ahead for next sprint and sync.
@lucferbux and @andrewballantyne
Yes, this is just a first pass at updating the SLIs/SLOs. Once everyone has updated the SLIs/SLOs for their components, and everyone has a chance to comment on the document, we will all meet an discuss how they can be improved more, if needed.
Yes, update, some of the User Journeys are outside of scope right now for the dashboard. We mostly use client side requests, so we don't have observability in our pod, we can take a further look about that, I'm sure there's a way to monitor pass through, but the current set up with prometheus will not handle that. cc @andrewballantyne @alexcreasy @christianvogt any thoughts here?
We have a record-based logging system (outside of Pod logs) Landon put together for adminActivity
. But we don't have general information like requests for all users and other observability other than whatever is naturally done by OpenShift Console on Pods.
What kind of observability do we need?
We can have a conversation about this, as the Dashboard is mainly a reflection of Openshift and the rest of the components, I think more or less everything is covered, but there are plenty of edge cases that we are adding that would be amazing to be covered (i.e. roles in projects and so). Not sure how much of that is relevant enough, but for example for the As a data scientist, I want to serve a model by creating a model server and then serving the saved model.
User Journey, maybe logging the issue of failed rolebindings when enabling route security can be interesting.
cc @dgutride
We have migrated all work to Jira - please contact me off line to talk about how to move this forward if this is still a requirement or there is any remaining work on it.
We need to update the SLIs/SLOs for all components. This issue is for updating the RHODS Dashboard component.
This issue will require two steps:
I will share the document that needs to be updated, when this issue is assigned.
If this could be completed in the next couple of weeks, that would be great.