Release test dataset_shuffle_push_based_random_shuffle_100tb.aws failed

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

34.04k stars 5.78k forks source link

Release test dataset_shuffle_push_based_random_shuffle_100tb.aws failed #39527

Closed can-anyscale closed 10 months ago

can-anyscale commented 1 year ago

Release test dataset_shuffle_push_based_random_shuffle_100tb.aws failed. See https://buildkite.com/ray-project/release-tests-branch/builds/2148#018a7975-874a-4162-b13f-d6d28a01da77 for more details.

Managed by OSS Test Policy

can-anyscale commented 1 year ago

Test has been failing for far too long. Jailing.

can-anyscale commented 1 year ago

Close to see if it is fixed

can-anyscale commented 1 year ago

Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release-tests-branch/builds/2204#018ae702-70b6-4d83-b444-cb40c2b7c832

can-anyscale commented 1 year ago

Just FYI @scottjlee , this test has been failing more than a month. I think @bveeramani and myself previously tried to bisect but we failed (the test requires some type of number of machines that make it hard to run).

scottjlee commented 1 year ago

yeah i have been slowly bisecting this for the past few weeks as well, but it's slow going (due to the reasons you mentioned), and also got inundated with a bunch of other oncall requests. Still looking into it, thanks for bumping

stephanie-wang commented 1 year ago

@can-anyscale is it okay to close this to try running the test again? Looks like it passed 10/1-10/6. I'd like to get the metrics on memory usage per node on a fresh run too.

scottjlee commented 1 year ago

wait, this test runs weekly right? are there manual runs for those dates where the 100TB test succeeds? @stephanie-wang

can-anyscale commented 1 year ago

@stephanie-wang of course; you can also run the test on-demand by creating a new build in https://buildkite.com/ray-project/release-tests-branch/builds?branch=master, without having to wait for the nightly run

stephanie-wang commented 1 year ago

wait, this test runs weekly right? are there manual runs for those dates where the 100TB test succeeds? @stephanie-wang

Hmm not sure how the tests were triggered, but I see them here.

stephanie-wang commented 1 year ago

The root cause is probably excessive task OOM failures, but not sure if that's a regression in Data, regression in some other cluster env, or just an inherent issue with the test setup. Seems we've had worker OOMs in much older runs too, so I don't think it's a release blocker for 2.8.

Since the root cause is excessive task OOM, there are a few options to handle for 2.9:

There might be a memory leak regression in the data task. This is the most important thing to look into
Have a smarter way to prevent task OOM failures (this has been discussed in the past but never prioritized) - i.e. wait to retry OOM-killed tasks / adjust Data scheduling policy to account for memory usage
Update the test to ensure we don't OOM (by adjusting instance type, number of partitions, etc) - recommend we do this anyway so that the test is more stable in the future

can-anyscale commented 1 year ago

I mark the test as unstable so it won't create release blocker for 2.8 release: https://github.com/ray-project/ray/pull/40437

anyscalesam commented 1 year ago

@stephanie-wang are you still planning on investigating further and implementing some/all of the options above for ray29?

stephanie-wang commented 1 year ago

This issue may have been fixed already since we had some failures in related tests. Let me trigger the test now, and will try to investigate if it's still an issue.

anyscalesam commented 1 year ago

@stephanie-wang did the test pass?

can-anyscale commented 1 year ago

let's close to run the test in this weekly run; the issue will re-open itself if the test doesn't pass

can-anyscale commented 1 year ago

Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release/builds/1661#018be237-2f08-41eb-b5c6-043d94533e36

stephanie-wang commented 12 months ago

Ah thanks. It looks like there is a real memory regression in Data but seems it was also failing in 2.8. I think we can ignore for now and address as p0 for 2.10.

can-anyscale commented 10 months ago

Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release/builds/5289#018cde80-6077-4406-a258-14dd970f341e

stephanie-wang commented 10 months ago

Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release/builds/5289#018cde80-6077-4406-a258-14dd970f341e

Ah sorry, introduced a bug in the test script...

can-anyscale commented 10 months ago

Test passed on latest run: https://buildkite.com/ray-project/release/builds/5909#018d0243-4275-4ac6-8203-7bfc058255f5