[air] release test `air_benchmark_xgboost_cpu_10` is failing

xwjiang2010 commented 1 year ago

https://buildkite.com/ray-project/release-tests-branch/builds/1569#01879143-c67d-4452-b53a-f233e3048feb

xwjiang2010 commented 1 year ago

RuntimeError: Batch prediction on XGBoost is taking 1243.001983792 seconds, which is longer than expected (450 seconds).

'prediction_time': 389.9208416849999

prediction time is 2X longer.

xwjiang2010 commented 1 year ago

running bisecting

failing commit passing commit

xwjiang2010 commented 1 year ago

launching another one

Screen Shot 2023-04-18 at 10 34 55 AM

xwjiang2010 commented 1 year ago

de1b16f8b5 [requirements] Add PyArrow to ray[tune] dependencies (#34397)  --> FAILED

d82aa574d9 [Dataset] Validate aggregation key in `Aggregate` LogicalOperator (#34292)
863928c4f1 [RLlib] DreamerV3: Catalog enhancements 04  - LSTM default models. (#34272)
26f391fa0b [CI][Bisect][1] Skeleton for automated bisect of release tests (#34329)
f7aa53c0fd [train] rename _base_dataset to _base_datastream (#34423)
cb59e59ac5 Add GCE variation for core release tests [3/n] (#34425)
bd8905b0ca Revert "[Metrics] Fix shared memory is not displayed properly  (#34301)" (#34407)
a333017dbf [RLlib] Add 2D box example for PPO RL Modules (#33840)
43920e2185 Remove python 3.6 support [1/n] (#34373)

0e0c150655 [Dataset] Reset row count when filtering on Dataset reading from Parquet (#34372)  --> FAIL

ffeedbf63e [data] Add usage tag for which block formats are used (#34384). --> FAIL

d86624502b [Dataset] Validate sort key in `Sort` LogicalOperator (#34282)  --> SUCCESSFUL

6d69d79cfe [data] [streaming] [part 3/n] Rename Dataset => Datastream in internal files (#34340)
0100e64512 [Data] combine_chunks before chunking pyarrow.Table block into batches (#34352)

df1c744aed [air] DreamBooth example: Fix code for batch size > 1 (#34398)  --> SUCCESSFUL

b8bd720573 [tune] fix a typo in `tune/execution/checkpoint_manager` state serialization. (#34368)
1f7058ee40 add main for obod test (#34311)
242d7b4b96 [serve] Fix get endpoint when autoscaling config is set (#34377)
df53c234dc pull out shared deploy code into deploy utils (#34321)

4571f1c1a2 [RLlib] Check that results has learner info appo test (#34381)   --> SUCCESSFUL

xwjiang2010 commented 1 year ago

ffeedbf63e --> FAIL xgboost==1.3.3 xgboost-ray==0.1.15

================================================= d86624502b --> SUCCESS xgboost==1.3.3 xgboost-ray==0.1.15

suspecting https://github.com/ray-project/ray/pull/34384/files

xwjiang2010 commented 1 year ago

@ericl @c21 Could you take a look here?

The prediction part of this release test is seeing a huge regression of prediction time: from 390s to 1243s between SUCCESS and FAIL ones. Bisecting is pointing to https://github.com/ray-project/ray/pull/34384/. Looks like it's only adding some metrics and hopefully (?) should be lightweight.

xwjiang2010 commented 1 year ago

cc @krfricke for awareness as the next ml oncall.

can-anyscale commented 1 year ago

Just want to also confirm that I run another bisect, after fixed (https://buildkite.com/ray-project/release-tests-bisect/builds/56#_), and it also pointed to ffeedbf63e as the blamed PR as well

ericl commented 1 year ago

That's unfortunate. I think we should revert the PR, it's too bad we can't get the telemetry but it's not critical.

c21 commented 1 year ago

I guess the overhead is coming from lock contention - https://github.com/ray-project/ray/blob/c9b6e6a39c44d6b19f4dcd8ba50151d46b2cc780/python/ray/data/_internal/usage.py#L66 . I am thinking of whether (1).lock-free - just record if the block format is used, not count the number of usage, (2).reduce contention - one lock per block format, etc.

ray-project / ray

[air] release test `air_benchmark_xgboost_cpu_10` is failing #34509