SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Route tables returned by the config.py _get_route_tables() function were incorrect after being called once because of the @functools.lru_cache defined on the function. I added region to the function call to keep the cache and have it return the correct results (switching regions before the change caused the issue).
Also added a logger.error() call to the _skypilot_log_error_and_exit_for_failover() function to actually log an error which shows up in stderr. Without this you get a generic error that does not tell you the actual cause of the error. You needed to go into the provision log to get the actual error.
Tested (run the relevant ones):
[X] Code formatting: bash format.sh
[X] Any manual or new tests for this PR (please specify below)
Log error before throwing exception
Route tables returned by the config.py _get_route_tables() function were incorrect after being called once because of the @functools.lru_cache defined on the function. I added region to the function call to keep the cache and have it return the correct results (switching regions before the change caused the issue).
Also added a logger.error() call to the _skypilot_log_error_and_exit_for_failover() function to actually log an error which shows up in stderr. Without this you get a generic error that does not tell you the actual cause of the error. You needed to go into the provision log to get the actual error.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh
Ran successful A10G:1 deploys and unsuccessful H100:8 deploys across all my regions in AWS and the appropriate messages are shown
CC: @concretevitamin