Closed xwjiang2010 closed 1 year ago
RuntimeError: Batch prediction on XGBoost is taking 1243.001983792 seconds, which is longer than expected (450 seconds).
'prediction_time': 389.9208416849999
prediction time is 2X longer.
de1b16f8b5 [requirements] Add PyArrow to ray[tune] dependencies (#34397) --> FAILED
d82aa574d9 [Dataset] Validate aggregation key in `Aggregate` LogicalOperator (#34292)
863928c4f1 [RLlib] DreamerV3: Catalog enhancements 04 - LSTM default models. (#34272)
26f391fa0b [CI][Bisect][1] Skeleton for automated bisect of release tests (#34329)
f7aa53c0fd [train] rename _base_dataset to _base_datastream (#34423)
cb59e59ac5 Add GCE variation for core release tests [3/n] (#34425)
bd8905b0ca Revert "[Metrics] Fix shared memory is not displayed properly (#34301)" (#34407)
a333017dbf [RLlib] Add 2D box example for PPO RL Modules (#33840)
43920e2185 Remove python 3.6 support [1/n] (#34373)
0e0c150655 [Dataset] Reset row count when filtering on Dataset reading from Parquet (#34372) --> FAIL
ffeedbf63e [data] Add usage tag for which block formats are used (#34384). --> FAIL
d86624502b [Dataset] Validate sort key in `Sort` LogicalOperator (#34282) --> SUCCESSFUL
6d69d79cfe [data] [streaming] [part 3/n] Rename Dataset => Datastream in internal files (#34340)
0100e64512 [Data] combine_chunks before chunking pyarrow.Table block into batches (#34352)
df1c744aed [air] DreamBooth example: Fix code for batch size > 1 (#34398) --> SUCCESSFUL
b8bd720573 [tune] fix a typo in `tune/execution/checkpoint_manager` state serialization. (#34368)
1f7058ee40 add main for obod test (#34311)
242d7b4b96 [serve] Fix get endpoint when autoscaling config is set (#34377)
df53c234dc pull out shared deploy code into deploy utils (#34321)
4571f1c1a2 [RLlib] Check that results has learner info appo test (#34381) --> SUCCESSFUL
ffeedbf63e --> FAIL xgboost==1.3.3 xgboost-ray==0.1.15
================================================= d86624502b --> SUCCESS xgboost==1.3.3 xgboost-ray==0.1.15
suspecting https://github.com/ray-project/ray/pull/34384/files
@ericl @c21 Could you take a look here?
The prediction part of this release test is seeing a huge regression of prediction time: from 390s to 1243s between SUCCESS and FAIL ones. Bisecting is pointing to https://github.com/ray-project/ray/pull/34384/. Looks like it's only adding some metrics and hopefully (?) should be lightweight.
cc @krfricke for awareness as the next ml oncall.
Just want to also confirm that I run another bisect, after fixed (https://buildkite.com/ray-project/release-tests-bisect/builds/56#_), and it also pointed to ffeedbf63e as the blamed PR as well
That's unfortunate. I think we should revert the PR, it's too bad we can't get the telemetry but it's not critical.
I guess the overhead is coming from lock contention - https://github.com/ray-project/ray/blob/c9b6e6a39c44d6b19f4dcd8ba50151d46b2cc780/python/ray/data/_internal/usage.py#L66 . I am thinking of whether (1).lock-free - just record if the block format is used, not count the number of usage, (2).reduce contention - one lock per block format, etc.
https://buildkite.com/ray-project/release-tests-branch/builds/1569#01879143-c67d-4452-b53a-f233e3048feb