We had been using a fixed amount of memory for the cpu/mem/gpu trackers. But these need memory per worker (since we connect to each worker).
This also updates the kubectl logs in the logs container to pass in a --max-logs-requests that is set to num workers for the run.
This also fixes some issues with the custodian vs torchx jobs, to avoid tracking worker status and runtime env setup (since, for now, those are ray-specific).
We had been using a fixed amount of memory for the cpu/mem/gpu trackers. But these need memory per worker (since we connect to each worker).
This also updates the
kubectl logs
in the logs container to pass in a--max-logs-requests
that is set to num workers for the run.This also fixes some issues with the custodian vs torchx jobs, to avoid tracking worker status and runtime env setup (since, for now, those are ray-specific).