Closed xwjiang2010 closed 1 year ago
grabbed a bunch of commits between successful and failing runs
git log --oneline d9aacb6de8c72b15a622ebcd4e6c36bc4a67dd88^..92d6f1f3069201c7f2acb542eac25e7fa9a291bf
92d6f1f306 [ci/release] Migrate GBDT tests (xgboost/lightgbm) to GCE (#34264)
4013930146 [Part 2/n] Rename Dataset => Datastream (DataContext, DataIterator, GroupedDatastream) (#34186)
c8a4b98670 [Data] Support using concurrent actors for `ActorPool` (#34253)
fd6b99ac75 [Core] Introduce spill_on_unavailable option for soft NodeAffinitySchedulingStrategy (#34224)
156be229fe [RLlib] Change occurences of `"_observation_space_in_preferred_format"` to `"_obs_space_in_preferred_format"` (#33907)
1cc5916310 [try 2] [doc] [data] Fix autosummary issues (#34228)
d947fb025c [serve] Log to file on LongPollClient update (#34204)
fba9d15db5 [data] Add take_batch API for collecting data in the same format as iter_batches and map_batches (#34217)
384446845f [docs][KubeRay] Provide some GKE instructions in KubeRay example (#33339)
13ef40b0e6 [Data] Update path expansion warning (#34221)
d1e7629823 [core] Task backend - Add worker died info to failed tasks when job exits. (#34166)
3c22ad6704 [RLlib] Add examples and docs for Catalog. (#33898)
a3544a298c [RLlib] Remove infos dict before Json_writer writes sample batches (#33896)
3530ed098c [Core] Fix ray start command output (#34081)
4168b9bdae [RLlib] Fixed a bug with kl divergence calculation of torch.Dirichlet distribution within RLlib (#34209)
f3bd6c0009 [core] prestart worker on node startup (#33623)
6fca66776f [core][ci] Fix test_fault_tolerance_actor_tasks_failed for test_task_events_2.py (#34237)
f255ddae8c [Serve] [Docs] Clarify that the Serve config only supports remote URIs (#34212)
0a8471e428 [RLlib] Change broken link in parameter_noise.py (#34231)
64b69dc2df [Dataset] Fix breaking Data CI tests (#34195)
6db3f1fb89 [Actor] [Code Quality] Add Unit Tests for Actors Sorting (#34058)
fe969396ab [docs][KubeRay] Update KubeRay doc for release v0.5.0 (#34178)
10be570375 [Core] lazy import autoscaler + don't import opentelemetry unless setup hook (#33964)
4ad2cd162b [core] Fix the placement group stress test regression. (#34192)
16283f822a [Serve][Release][Part1] Enable tests to GCE (#34163)
d9aacb6de8 Revert "[doc] [data] Fix autosummary issues (#34220)" (#34227)
Doesn't seem to have anything related to tune.
@can-anyscale I see you have been doing some bisections. Do you know if there is any change infra side before I dig into the application code?
@xwjiang2010 - my bisect (https://buildkite.com/ray-project/release-tests-bisect/builds/37) points to fd6b99ac75, but rerun the test result a couple of time on that commit and the test might pass. So my best guess is the test is flaky (or failed due to a non-application error code).
Maybe try to run the test on top of master again to see if it now passed.
Actually @can-anyscale, let's pause investigation on this one for now.
There may be already a fix for this, as @justinvyu pointed out: https://github.com/ray-project/ray/pull/34263#pullrequestreview-1380215668
Let's see how it goes after landing this PR.
@xwjiang2010 sound great!
Issue seems resolved after the PR was merged.
linked the wrong one.
https://buildkite.com/ray-project/release-tests-branch/builds/1548#01877273-5d06-4bf2-92fd-6360e5918ca3