ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.25k stars 5.62k forks source link

[ci] Rebuild dashboard client on PR so that build errors could be caught on PR early. #33638

Open rickyyx opened 1 year ago

rickyyx commented 1 year ago

What happened + What you expected to happen

When PR https://github.com/ray-project/ray/pull/33555 was merged, it broke the master build by erroring on the bootstrap jobs, preventing all subsequent tests to run. See issue https://github.com/ray-project/ray/issues/33634

However, the PR had successfully boostrapped job run: https://buildkite.com/ray-project/oss-ci-build-pr/builds/15834#_ . We should have caught this before the PR was merged into the master.

Versions / Dependencies

NA

Reproduction script

NA

Issue Severity

None

rickyyx commented 1 year ago

The current hypothesis is that on the PR build, the incorrect dashboard client code was cached in images built by previous commits, and thus not erroring on later commits when the PR was merged.

rickyyx commented 1 year ago

cc @can-anyscale as well.

rickyyx commented 1 year ago

Oh actually, I think the issue was caused due to concurrent commits: https://github.com/ray-project/ray/commit/29e72de8afdd1f6feebb34de40dca242b7f1b8a6 added the test which required the PR #33555 to make some updates, and since the PR #33555 did not have the change while merging, it thus broke the build after merged.

I guess this is hard to prevent unless we have some additional checks on merge time.

ollie-iterators commented 1 year ago

I think that one of the issues with merging pull requests causing issues with running the build is that since there are lots of flakey tests (593 shown in https://flakey-tests.ray.io/ right now), that it makes it easier to sometimes ignore when the red X shows up for a commit and assume that it is ready to be merged.

ollie-iterators commented 1 year ago

I think that a good way to fix this for the future would be to merge https://github.com/ray-project/ray/pull/29622, which is about turning on more linting for the code. Another important PR is https://github.com/ray-project/ray/pull/31219 which is about causing OSS CI to fail on pytest warnings.

ollie-iterators commented 1 year ago

Another good example for the need for merge checks:

This screen capture was taken from the PR https://github.com/ray-project/ray/pull/32808. It was from a "merge recent changes into ths PR" commit.

Screenshot 2023-03-24 at 4 53 48 PM

The error for the code says that: "Cinit() takes 21 positional arguments (19 given)"

ollie-iterators commented 1 year ago

It appears that https://github.com/ray-project/ray/pull/34220 causes the number of flakey tests to go up to 1,147