ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.14k stars 5.8k forks source link

Release test rllib_learning_tests_ppo_new_api_stack_torch.aws failed #43720

Closed can-anyscale closed 2 months ago

can-anyscale commented 8 months ago

Release test rllib_learning_tests_ppo_new_api_stack_torch.aws failed. See https://buildkite.com/ray-project/release/builds/10431#018e0e4a-bed2-4fac-81d4-911f40a2d57b for more details.

Managed by OSS Test Policy

can-anyscale commented 8 months ago

Test has been failing for far too long. Jailing.

anyscalesam commented 8 months ago

@sven1977 any update on this; we need confirmation that this is/is-not a release-blocker for ray210

sven1977 commented 8 months ago

Sorry for the delay, looking into it rn ...

sven1977 commented 8 months ago

Ok, seems to be a wrong machine config on our end. 4 GPUs where 8 needed. Will fix asap ...

sven1977 commented 8 months ago

Closing this issue. Fixed by https://github.com/ray-project/ray/pull/44001

can-anyscale commented 8 months ago

Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release/builds/11364#018e40a9-0d08-4a0d-a61f-727e7786ca6a

khluu commented 8 months ago

This test still failed on releases/2.10.0 branch even after the fix was cherry picked in @sven1977

khluu commented 8 months ago

@sven1977 confirmed the test works when kicked off manually

can-anyscale commented 8 months ago

Re-opening issue as test is still failing. Latest run: https://buildkite.com/ray-project/release/builds/11543#018e5541-f276-4b5c-8b85-e42a93201d0e

can-anyscale commented 8 months ago

https://github.com/ray-project/ray/pull/44116 to mark this test as unstable

can-anyscale commented 8 months ago

was not a blocker for 2.10

sven1977 commented 2 months ago

Closing. This CI test has been stabilized