ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.92k stars 5.77k forks source link

[release] tune_cloud_test_durable_upload.aws if failing #34369

Closed xwjiang2010 closed 1 year ago

xwjiang2010 commented 1 year ago

https://buildkite.com/ray-project/release-tests-branch/builds/1548#01877273-5d06-4bf2-92fd-6360e5918ca3

xwjiang2010 commented 1 year ago

grabbed a bunch of commits between successful and failing runs

git log --oneline d9aacb6de8c72b15a622ebcd4e6c36bc4a67dd88^..92d6f1f3069201c7f2acb542eac25e7fa9a291bf

92d6f1f306 [ci/release] Migrate GBDT tests (xgboost/lightgbm) to GCE (#34264)
4013930146 [Part 2/n] Rename Dataset => Datastream (DataContext, DataIterator, GroupedDatastream) (#34186)
c8a4b98670 [Data] Support using concurrent actors for `ActorPool` (#34253)
fd6b99ac75 [Core] Introduce spill_on_unavailable option for soft NodeAffinitySchedulingStrategy (#34224)
156be229fe [RLlib] Change occurences of `"_observation_space_in_preferred_format"` to `"_obs_space_in_preferred_format"` (#33907)
1cc5916310 [try 2] [doc] [data] Fix autosummary issues  (#34228)
d947fb025c [serve] Log to file on LongPollClient update (#34204)
fba9d15db5 [data] Add take_batch API for collecting data in the same format as iter_batches and map_batches (#34217)
384446845f [docs][KubeRay] Provide some GKE instructions in KubeRay example (#33339)
13ef40b0e6 [Data] Update path expansion warning (#34221)
d1e7629823 [core] Task backend - Add worker died info to failed tasks when job exits.  (#34166)
3c22ad6704 [RLlib] Add examples and docs for Catalog. (#33898)
a3544a298c [RLlib] Remove infos dict before Json_writer writes sample batches (#33896)
3530ed098c [Core] Fix ray start command output (#34081)
4168b9bdae [RLlib] Fixed a bug with kl divergence calculation of torch.Dirichlet distribution within RLlib (#34209)
f3bd6c0009 [core] prestart worker on node startup (#33623)
6fca66776f [core][ci] Fix test_fault_tolerance_actor_tasks_failed for test_task_events_2.py (#34237)
f255ddae8c [Serve] [Docs] Clarify that the Serve config only supports remote URIs (#34212)
0a8471e428 [RLlib] Change broken link in parameter_noise.py (#34231)
64b69dc2df [Dataset] Fix breaking Data CI tests (#34195)
6db3f1fb89 [Actor] [Code Quality] Add Unit Tests for Actors Sorting  (#34058)
fe969396ab [docs][KubeRay] Update KubeRay doc for release v0.5.0 (#34178)
10be570375 [Core] lazy import autoscaler + don't import opentelemetry unless setup hook (#33964)
4ad2cd162b [core] Fix the placement group stress test regression. (#34192)
16283f822a [Serve][Release][Part1] Enable tests to GCE (#34163)
d9aacb6de8 Revert "[doc] [data] Fix autosummary issues (#34220)" (#34227)

Doesn't seem to have anything related to tune.

@can-anyscale I see you have been doing some bisections. Do you know if there is any change infra side before I dig into the application code?

can-anyscale commented 1 year ago

@xwjiang2010 - my bisect (https://buildkite.com/ray-project/release-tests-bisect/builds/37) points to fd6b99ac75, but rerun the test result a couple of time on that commit and the test might pass. So my best guess is the test is flaky (or failed due to a non-application error code).

Maybe try to run the test on top of master again to see if it now passed.

xwjiang2010 commented 1 year ago

Actually @can-anyscale, let's pause investigation on this one for now.

There may be already a fix for this, as @justinvyu pointed out: https://github.com/ray-project/ray/pull/34263#pullrequestreview-1380215668

Let's see how it goes after landing this PR.

can-anyscale commented 1 year ago

@xwjiang2010 sound great!

xwjiang2010 commented 1 year ago

Issue seems resolved after the PR was merged.

xwjiang2010 commented 1 year ago

https://buildkite.com/ray-project/release-tests-branch/builds/1541#018766ac-6e43-44bd-a0df-5e42ecd43fbe

xwjiang2010 commented 1 year ago

linked the wrong one.