skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.86k stars 518 forks source link

Implement Automated Weekly Smoke Tests #4112

Open andylizf opened 1 month ago

andylizf commented 1 month ago

Implement Automated Weekly Smoke Tests

Problem

Currently, smoke tests for SkyPilot (implemented in test_smoke.py) are being run manually. This process could be improved by automating these tests on a weekly basis.

Proposed Solution

Implement an automated weekly smoke test run using a suitable CI/CD platform or automation tool, leveraging the existing test_smoke.py script.

Implementation Details

  1. Automation Setup:

    • Create a new workflow or job for weekly smoke tests
    • Schedule the workflow to run weekly
    • Use the existing test_smoke.py script in the automation
  2. Environment Setup:

    • Add necessary cloud credentials securely to the CI/CD platform
    • Ensure the test runner has the required dependencies installed
  3. Test Execution:

    • Run pytest tests/test_smoke.py --terminate-on-failure
    • Handle different test groups (AWS, GCP, Azure, etc.) as defined in the script

Challenges to Address

  1. Credit Control:

    • Implement a mechanism to limit cloud credits used during automated tests
    • Consider setting up budget alerts or usage limits
  2. Test Stability and Retries:

    • Implement retry logic for transient failures
    • Define criteria for test failures vs. retries
    • Set a maximum number of retry attempts
  3. Multi-Cloud Testing:

    • Ensure tests cover all supported cloud providers as defined in test_smoke.py
    • Handle potential differences in setup or execution across clouds

Next Steps

Feedback on implementation details and challenge mitigation strategies is welcome, particularly from those familiar with test_smoke.py and our current testing processes.

asaiacai commented 1 month ago

i have some buildkite stuff i've already setup for GKE that would probably adapt well to running skypilot smoke tests. Happy to work with the team on sharing it since I'm depending pretty heavily on skypilot+k8s right now

romilbhardwaj commented 1 month ago

Another cost optimization - we can run a k8s cluster as a part of our CI and move many of the cloud agnostic tests (e.g., those which test core functionality of SkyPilot) to run on a Kubernetes cluster provisioned in github actions for the duration of the test.

See: https://github.com/marketplace/actions/kind-kubernetes-in-docker-action Step-by-step blog: https://dev.to/kitarp29/running-kubernetes-on-github-actions-f2c

Need to evaluate cost-benefit (test migration effort vs reduced cloud cost) before we implement the above.