ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.13k stars 5.48k forks source link

[CI] MacOS build/test outage in Feb '23 #33530

Closed cadedaniel closed 8 months ago

cadedaniel commented 1 year ago

We had an outage in our MacOS build/test fleet. I am not sure what action items we can take to fix this/improve this for the future; we should postmortem or just ask around to see if there's anything we should do so this isn't as bad next time.

image

cadedaniel commented 1 year ago

FYI @krfricke @rkooo567 @justinvyu in case you have anything to add (ideas to improve for next time). Was there a postmortem for this?

rkooo567 commented 1 year ago

I think the main issue was the only Kai knows how to fix the issue (so when he's busy or when this is not communicated to him, we couldn't fix it). Also the ownership of this infra for oncall engineers wasn't clear. I remembered @zhe-thoughts said we would do postmortem for this one.

zhe-thoughts commented 1 year ago

This should be on the backlog of @can-anyscale . Let's finish the current effort of GCE release testing and see which other P0/P1 issue should be picked up.

aslonnie commented 8 months ago

macos tests are back on line. now sure how it was fixed though..