Closed bveeramani closed 1 year ago
It's failing in master as well - likely a dependency issue?
@bveeramani @rickyyx can we try the same test now without https://github.com/ray-project/ray/pull/37132, to verify if it is indeed a dependency issue?
Yes - I had a PR that reverts the #37132 on the release branch, but the test is still failing: https://buildkite.com/ray-project/oss-ci-build-pr/builds/28487#01894c4a-3099-4138-86fa-055eac5a3a3f
@woshiyyya Let me know if you need me for anything.
So failed job:
Previos passed job:
This is probably the one ^
Good find @rickyyx
@matthewdeng @woshiyyya do you think we should pin pytorch-lightning?
Thank you @rickyyx and @zhe-thoughts . I've checked the change log of pytorch-lightning 2.0.5, they indeed added some checks over deepspeed configuration, which caused this issue. https://lightning.ai/docs/pytorch/stable/generated/CHANGELOG.html
The quick fix would be pinning the library version. But I am still think of a long term solution, so that we can fix the root cause.
I drafted a fix PR. Waiting for the CI to pass. https://github.com/ray-project/ray/pull/37387
CI is passing on the PR. Shall we merge?
python/ray/train/tests/test_lightning_deepspeed.py::test_deepspeed_stages[True-3]
has been consistently failing on the release branch (releases/2.6.0) since https://github.com/ray-project/ray/pull/37132 was cherry picked.https://buildkite.com/ray-project/oss-ci-build-branch/builds/4892#0189410c-4cd8-499f-80f8-01050df7e1bf