skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

Support event based smoke test instead of sleep time based to reduce flaky test and faster test #4284

Closed zpoint closed 2 days ago

zpoint commented 2 weeks ago

Add some bash functions as global variables:

# cluster
_WAIT_UNTIL_CLUSTER_STATUS_IS
_WAIT_UNTIL_CLUSTER_IS_NOT_FOUND
_WAIT_UNTIL_JOB_STATUS_CONTAINS_MATCHING_JOB_ID 
_WAIT_UNTIL_JOB_STATUS_CONTAINS_WITHOUT_MATCHING_JOB
_WAIT_UNTIL_JOB_STATUS_CONTAINS_MATCHING_JOB_NAME

# managed jobs
_WAIT_UNTIL_MANAGED_JOB_STATUS_CONTAINS_MATCHING_JOB_NAME

Before

python -m pytest -n 6 tests/test_smoke.py::test_launch_fast_with_autostop tests/test_smoke.py::test_clone_disk_aws tests/test_smoke.py::test_stale_job tests/test_smoke.py::test_aws_stale_job_manual_restart tests/test_smoke.py::test_multi_echo tests/test_smoke.py::test_autostop tests/test_smoke.py::test_managed_jobs_failed_setup --aws

[success] 30.60% tests/test_smoke.py::test_autostop: 1435.7293s
[success] 16.65% tests/test_smoke.py::test_clone_disk_aws: 781.2931s
[success] 12.00% tests/test_smoke.py::test_aws_stale_job_manual_restart: 562.9241s
[success] 11.79% tests/test_smoke.py::test_multi_echo: 553.2864s
[success] 11.28% tests/test_smoke.py::test_managed_jobs_failed_setup: 529.3282s
[success] 11.05% tests/test_smoke.py::test_launch_fast_with_autostop: 518.6133s
[success] 6.62% tests/test_smoke.py::test_stale_job: 310.6923s
7 passed, 1122 warnings in 1437.70s (0:23:57)

After

python -m pytest -n 6 tests/test_smoke.py::test_launch_fast_with_autostop tests/test_smoke.py::test_clone_disk_aws tests/test_smoke.py::test_stale_job tests/test_smoke.py::test_aws_stale_job_manual_restart tests/test_smoke.py::test_multi_echo tests/test_smoke.py::test_autostop tests/test_smoke.py::test_managed_jobs_failed_setup --aws

[success] 26.90% tests/test_smoke.py::test_autostop: 900.6892s
[success] 19.89% tests/test_smoke.py::test_clone_disk_aws: 666.1263s
[success] 18.24% tests/test_smoke.py::test_multi_echo: 610.6881s
[success] 11.87% tests/test_smoke.py::test_launch_fast_with_autostop: 397.4097s
[success] 9.06% tests/test_smoke.py::test_stale_job: 303.4335s
[success] 7.82% tests/test_smoke.py::test_aws_stale_job_manual_restart: 261.7197s
[success] 6.23% tests/test_smoke.py::test_managed_jobs_failed_setup: 208.6965s
7 passed, 1332 warnings in 902.35s (0:15:02)

40% faster, and less flaky

And modify some smoke test cases to change to these functions.

The aim is to replace all sleep xxx time-based waiting with these event-based waiting functions.

There are over 100 lines of time-based tests that need to be replaced. This PR won't replace all at once(Otherwise the PR would be too large). If all goes well, it will be the first PR to offer this possibility, and subsequent PRs will replace the rest.

This should make the test runs faster and less flaky.

image

Tested (run the relevant ones):

zpoint commented 3 days ago

Hi @romilbhardwaj, this PR adds more test cases and global functions based on current PR.

To avoid disrupting what's already reviewed, I opened another PR based on this one. Could you please help review it? Thanks!