skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

Fix AWS Route Table caching which causes invalid failures in other regions after an initial valid failure. #4303

Closed sfrolich closed 1 week ago

sfrolich commented 1 week ago

Log error before throwing exception

Route tables returned by the config.py _get_route_tables() function were incorrect after being called once because of the @functools.lru_cache defined on the function. I added region to the function call to keep the cache and have it return the correct results (switching regions before the change caused the issue).

Also added a logger.error() call to the _skypilot_log_error_and_exit_for_failover() function to actually log an error which shows up in stderr. Without this you get a generic error that does not tell you the actual cause of the error. You needed to go into the provision log to get the actual error.

Tested (run the relevant ones):

Ran successful A10G:1 deploys and unsuccessful H100:8 deploys across all my regions in AWS and the appropriate messages are shown

CC: @concretevitamin

sfrolich commented 1 week ago

I think that doc change failed mypy let me fix