skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

Flaky test: `test_optimizer_dryruns.py` occasionally fails #4308

Closed andylizf closed 1 week ago

andylizf commented 1 week ago

The test test_infer_cloud_from_region_or_zone in test_optimizer_dryruns.py occasionally fails in GitHub Actions. For example: https://github.com/skypilot-org/skypilot/actions/runs/11737074167/job/32697299221

The error message suggests a cloud inference issue:

ValueError: Cannot infer cloud from (region 'us-east-2', zone None). Multiple enabled clouds have region/zone of the same names: [AWS, Lambda]. To fix: explicitly specify `cloud`.

However, this appears to be a resource availability issue causing the test to be flaky, rather than an actual cloud inference problem. The test sometimes passes and sometimes fails, indicating a potential race condition or resource constraint in the test environment.

We should either:

  1. Add retry logic for this test
  2. Mock the resource availability check
  3. Mark the test as flaky using pytest-rerunfailures
Michaelvll commented 1 week ago

fixed in #4302