Closed can-anyscale closed 10 months ago
Test has been failing for far too long. Jailing.
Close to see if it is fixed
Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release-tests-branch/builds/2204#018ae702-70b6-4d83-b444-cb40c2b7c832
Just FYI @scottjlee , this test has been failing more than a month. I think @bveeramani and myself previously tried to bisect but we failed (the test requires some type of number of machines that make it hard to run).
yeah i have been slowly bisecting this for the past few weeks as well, but it's slow going (due to the reasons you mentioned), and also got inundated with a bunch of other oncall requests. Still looking into it, thanks for bumping
@can-anyscale is it okay to close this to try running the test again? Looks like it passed 10/1-10/6. I'd like to get the metrics on memory usage per node on a fresh run too.
wait, this test runs weekly right? are there manual runs for those dates where the 100TB test succeeds? @stephanie-wang
@stephanie-wang of course; you can also run the test on-demand by creating a new build in https://buildkite.com/ray-project/release-tests-branch/builds?branch=master, without having to wait for the nightly run
wait, this test runs weekly right? are there manual runs for those dates where the 100TB test succeeds? @stephanie-wang
Hmm not sure how the tests were triggered, but I see them here.
The root cause is probably excessive task OOM failures, but not sure if that's a regression in Data, regression in some other cluster env, or just an inherent issue with the test setup. Seems we've had worker OOMs in much older runs too, so I don't think it's a release blocker for 2.8.
Since the root cause is excessive task OOM, there are a few options to handle for 2.9:
I mark the test as unstable so it won't create release blocker for 2.8 release: https://github.com/ray-project/ray/pull/40437
@stephanie-wang are you still planning on investigating further and implementing some/all of the options above for ray29?
This issue may have been fixed already since we had some failures in related tests. Let me trigger the test now, and will try to investigate if it's still an issue.
@stephanie-wang did the test pass?
let's close to run the test in this weekly run; the issue will re-open itself if the test doesn't pass
Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release/builds/1661#018be237-2f08-41eb-b5c6-043d94533e36
Ah thanks. It looks like there is a real memory regression in Data but seems it was also failing in 2.8. I think we can ignore for now and address as p0 for 2.10.
Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release/builds/5289#018cde80-6077-4406-a258-14dd970f341e
Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release/builds/5289#018cde80-6077-4406-a258-14dd970f341e
Ah sorry, introduced a bug in the test script...
Test passed on latest run: https://buildkite.com/ray-project/release/builds/5909#018d0243-4275-4ac6-8203-7bfc058255f5
Release test dataset_shuffle_push_based_random_shuffle_100tb.aws failed. See https://buildkite.com/ray-project/release-tests-branch/builds/2148#018a7975-874a-4162-b13f-d6d28a01da77 for more details.
Managed by OSS Test Policy