run-house / runhouse

Dispatch and distribute your ML training to "serverless" clusters in Python, like PyTorch for ML infra. Iterable, debuggable, multi-cloud/on-prem, identical across research and production.
https://run.house
Apache License 2.0
981 stars 37 forks source link

remove updates to cluster configs post den launch #1443

Closed jlewitt1 closed 1 week ago

jlewitt1 commented 1 week ago

If we save the cluster object properly in the den launcher immediately after launching, we can save the need to have to update any SSH creds or launched property fields.

Have a launcher PR up which saves the cluster to Den immediately after it is launched, saving us the need to reload or resave those attributes on the client

sentry-io[bot] commented 1 week ago

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: runhouse/resources/hardware/launcher_utils.py

Function Unhandled Issue
up Exception: Received [500] from Den POST 'https://api.run.house/cluster/up': {'code': 404, 'detail': 'No clus... ...
Event Count: 2
up Exception: Received [403] from Den POST 'https://api.run.house/cluster/up': {'detail': 'Not authenticated'} ...
Event Count: 2

Did you find this useful? React with a 👍 or 👎

jlewitt1 commented 1 week ago

This stack of pull requests is managed by Graphite. Learn more about stacking.

mkandler commented 1 week ago

I don't think we can remove this code if you want to be able to work with the cluster after launching. We need to return and set the configuration from the Launcher so that the cluster can be used afterwards. E.g. cluster.restart_server() or .to(cluster)