Made that empty Cluster instead of being kept in the memory, it could be auto-removed after a specific time or be overridden by a new cluster before that timeout. This helps with consecutive ray down and ray up.
Reworked on_stop events in Cluster, ClusterNode, and `ClusterNodeSidecar. They spawn separate asyncio tasks by default.
Moved monitor check failed logic from particular sidecar to its parent to better separation of concerns.
Fixed default head node priority_subnet_tag not being correctly handled.
Moved site packages installed from the system level to be installed in virtualenv that is copied over to golem volume. It can take around 2 minutes for some providers, but it is necessary to have pip working while we wait for more functionality in golem volumes.
Bumped total_budget from 1 to 5.
Removed some leftovers from workflows
Notable remarks:
Consecutive ray up with the same name but with different configs results in ray-specific errors, we decided that currently, we will not implement any additional warnings/guards about that.
What I've done:
Cluster
instead of being kept in the memory, it could be auto-removed after a specific time or be overridden by a new cluster before that timeout. This helps with consecutiveray down
andray up
.on_stop
events inCluster
,ClusterNode
, and `ClusterNodeSidecar. They spawn separate asyncio tasks by default.priority_subnet_tag
not being correctly handled.total_budget
from1
to5
.Notable remarks:
ray up
with the same name but with different configs results in ray-specific errors, we decided that currently, we will not implement any additional warnings/guards about that.