guidebooks / store

The home for importable Guidebooks
1 stars 10 forks source link

fix: improve custodian memory requests for larger jobs #760

Closed starpit closed 1 year ago

starpit commented 1 year ago

We had been using a fixed amount of memory for the cpu/mem/gpu trackers. But these need memory per worker (since we connect to each worker).

This also updates the kubectl logs in the logs container to pass in a --max-logs-requests that is set to num workers for the run.

This also fixes some issues with the custodian vs torchx jobs, to avoid tracking worker status and runtime env setup (since, for now, those are ray-specific).