[Release][AIR] Inconsistent batch prediction performance is causing `air_benchmark_xgboost` to be flaky

justinvyu commented 1 year ago

This seems to be a performance regression related to spilling to disk?

Timing out job (2/21/23): https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_nltqmbvtuhjbiuu3senydsbedp?pjd-section=last-log Succeeding job (2/22/23): https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_8ur7wqygilmzhdfw5idsec888u

See error logs from (2/21/23):

MapBatches(ScoringWrapper), 72 actors (18 pending) [15 locality hits, 151 misses]:   4%|▍         | 15/363 [00:18<08:10,  1.41s/it](raylet, ip=10.0.1.152) Spilled 18209 MiB, 44 objects, write throughput 585 MiB/s.
(raylet, ip=10.0.37.9) Spilled 17894 MiB, 83 objects, write throughput 614 MiB/s.
(autoscaler +14m5s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +14m5s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(raylet, ip=10.0.25.71) Spilled 17620 MiB, 39 objects, write throughput 609 MiB/s.
MapBatches(ScoringWrapper), 160 actors (40 pending) [24 locality hits, 331 misses]:   6%|▌         | 21/363 [00:38<14:06,  2.47s/it](raylet, ip=10.0.1.152) Spilled 25700 MiB, 81 objects, write throughput 567 MiB/s.
(raylet, ip=10.0.42.130) Spilled 25511 MiB, 76 objects, write throughput 564 MiB/s.
(raylet, ip=10.0.21.76) Spilled 25047 MiB, 79 objects, write throughput 568 MiB/s.
(raylet, ip=10.0.27.247) Spilled 25425 MiB, 76 objects, write throughput 547 MiB/s.
MapBatches(ScoringWrapper), 160 actors (40 pending) [25 locality hits, 332 misses]:   6%|▌         | 22/363 [00:42<14:49,  2.61s/it](raylet, ip=10.0.25.71) Spilled 25515 MiB, 76 objects, write throughput 663 MiB/s.
(raylet, ip=10.0.32.231) Spilled 25841 MiB, 77 objects, write throughput 560 MiB/s.
(raylet, ip=10.0.42.130) Spilled 25514 MiB, 77 objects, write throughput 564 MiB/s.
(raylet, ip=10.0.32.231) Spilled 25847 MiB, 79 objects, write throughput 560 MiB/s.
MapBatches(ScoringWrapper), 160 actors [26 locality hits, 337 misses]:   7%|▋         | 26/363 [00:52<14:23,  2.56s/it]             (raylet, ip=10.0.21.98) Spilled 25180 MiB, 74 objects, write throughput 557 MiB/s.
(raylet, ip=10.0.27.247) Spilled 25428 MiB, 77 objects, write throughput 547 MiB/s.
(raylet, ip=10.0.21.98) Spilled 25183 MiB, 75 objects, write throughput 557 MiB/s.
(raylet, ip=10.0.21.76) Spilled 25460 MiB, 83 objects, write throughput 569 MiB/s.
MapBatches(ScoringWrapper), 160 actors [26 locality hits, 337 misses]:   9%|▊         | 31/363 [01:14<18:28,  3.34s/it](raylet, ip=10.0.58.195) Spilled 25384 MiB, 81 objects, write throughput 579 MiB/s.
(raylet, ip=10.0.58.195) Spilled 25402 MiB, 88 objects, write throughput 579 MiB/s.
MapBatches(ScoringWrapper), 102 actors [26 locality hits, 337 misses]:  26%|██▌       | 93/363 [02:00<02:10,  2.07it/s](raylet, ip=10.0.28.63) Spilled 24912 MiB, 93 objects, write throughput 566 MiB/s.
MapBatches(ScoringWrapper), 81 actors [26 locality hits, 337 misses]:  32%|███▏      | 115/363 [02:03<00:42,  5.80it/s](raylet, ip=10.0.28.63) Spilled 24925 MiB, 99 objects, write throughput 566 MiB/s.
MapBatches(ScoringWrapper), 0 actors [26 locality hits, 337 misses]: 100%|██████████| 363/363 [02:10<00:00,  2.78it/s] 
run_xgboost_prediction takes 1544.04337004 seconds.
Results: {'training_time': 778.8808449190001, 'prediction_time': 1544.04337004}
Traceback (most recent call last):
  File "workloads/xgboost_benchmark.py", line 179, in <module>
    main(args)
  File "workloads/xgboost_benchmark.py", line 159, in main
    f"Batch prediction on XGBoost is taking {prediction_time} seconds, "
RuntimeError: Batch prediction on XGBoost is taking 1544.04337004 seconds, which is longer than expected (450 seconds).
Subprocess return code: 1

The failing run:

read_parquet: 00:08 + 00:02 ~= 00:10 = 10 seconds
batch_predictor.predict(...): 02:10 = 130 seconds
But the full prediction fn takes 1544 seconds. What else is introducing an extra 1000 seconds? Is there something wrong with how the test is being timed?

Past failures

This test has failed in the past for similar reasons (batch prediction taking a long time). Was that ever resolved?

Buildkite for one of these past failures: https://buildkite.com/ray-project/release-tests-branch/builds/1314#0186042a-ebf4-4f7c-b5c0-ddc950c23766

justinvyu commented 1 year ago

Another failure (480s > 450s): https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_pw8rwtedwq9xwnisjqgzpbxx2x?pjd-section=last-log

justinvyu commented 1 year ago

@bveeramani I think this is still failing. Can you take a look?

amogkam commented 1 year ago

The last 4 runs have been successful

this is passing yesterday https://buildkite.com/ray-project/release-tests-branch/builds/1500#01873491-2972-4305-8a95-420366a04c40
and on tuesday https://buildkite.com/ray-project/release-tests-branch/builds/1491#0187299a-95fa-4e0c-8f0e-bea0b0ed3b92
and on monday https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-eddb-4fa8-ac39-11e3e952e5e1
and on last thursday https://buildkite.com/ray-project/release-tests-branch/builds/1470#018710b9-a022-4ea2-bb78-616f93b92bc7

ray-project / ray

[Release][AIR] Inconsistent batch prediction performance is causing `air_benchmark_xgboost` to be flaky #32843

Past failures