Closed cadedaniel closed 8 months ago
FYI @krfricke @rkooo567 @justinvyu in case you have anything to add (ideas to improve for next time). Was there a postmortem for this?
I think the main issue was the only Kai knows how to fix the issue (so when he's busy or when this is not communicated to him, we couldn't fix it). Also the ownership of this infra for oncall engineers wasn't clear. I remembered @zhe-thoughts said we would do postmortem for this one.
This should be on the backlog of @can-anyscale . Let's finish the current effort of GCE release testing and see which other P0/P1 issue should be picked up.
macos tests are back on line. now sure how it was fixed though..
We had an outage in our MacOS build/test fleet. I am not sure what action items we can take to fix this/improve this for the future; we should postmortem or just ask around to see if there's anything we should do so this isn't as bad next time.