ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.92k stars 5.77k forks source link

[Release][AIR] Inconsistent batch prediction performance is causing `air_benchmark_xgboost` to be flaky #32843

Closed justinvyu closed 1 year ago

justinvyu commented 1 year ago

This seems to be a performance regression related to spilling to disk?

Screen Shot 2023-02-24 at 9 33 54 PM

Timing out job (2/21/23): https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_nltqmbvtuhjbiuu3senydsbedp?pjd-section=last-log Succeeding job (2/22/23): https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_8ur7wqygilmzhdfw5idsec888u

See error logs from (2/21/23):

MapBatches(ScoringWrapper), 72 actors (18 pending) [15 locality hits, 151 misses]:   4%|▍         | 15/363 [00:18<08:10,  1.41s/it](raylet, ip=10.0.1.152) Spilled 18209 MiB, 44 objects, write throughput 585 MiB/s.
(raylet, ip=10.0.37.9) Spilled 17894 MiB, 83 objects, write throughput 614 MiB/s.
(autoscaler +14m5s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +14m5s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(raylet, ip=10.0.25.71) Spilled 17620 MiB, 39 objects, write throughput 609 MiB/s.
MapBatches(ScoringWrapper), 160 actors (40 pending) [24 locality hits, 331 misses]:   6%|▌         | 21/363 [00:38<14:06,  2.47s/it](raylet, ip=10.0.1.152) Spilled 25700 MiB, 81 objects, write throughput 567 MiB/s.
(raylet, ip=10.0.42.130) Spilled 25511 MiB, 76 objects, write throughput 564 MiB/s.
(raylet, ip=10.0.21.76) Spilled 25047 MiB, 79 objects, write throughput 568 MiB/s.
(raylet, ip=10.0.27.247) Spilled 25425 MiB, 76 objects, write throughput 547 MiB/s.
MapBatches(ScoringWrapper), 160 actors (40 pending) [25 locality hits, 332 misses]:   6%|▌         | 22/363 [00:42<14:49,  2.61s/it](raylet, ip=10.0.25.71) Spilled 25515 MiB, 76 objects, write throughput 663 MiB/s.
(raylet, ip=10.0.32.231) Spilled 25841 MiB, 77 objects, write throughput 560 MiB/s.
(raylet, ip=10.0.42.130) Spilled 25514 MiB, 77 objects, write throughput 564 MiB/s.
(raylet, ip=10.0.32.231) Spilled 25847 MiB, 79 objects, write throughput 560 MiB/s.
MapBatches(ScoringWrapper), 160 actors [26 locality hits, 337 misses]:   7%|▋         | 26/363 [00:52<14:23,  2.56s/it]             (raylet, ip=10.0.21.98) Spilled 25180 MiB, 74 objects, write throughput 557 MiB/s.
(raylet, ip=10.0.27.247) Spilled 25428 MiB, 77 objects, write throughput 547 MiB/s.
(raylet, ip=10.0.21.98) Spilled 25183 MiB, 75 objects, write throughput 557 MiB/s.
(raylet, ip=10.0.21.76) Spilled 25460 MiB, 83 objects, write throughput 569 MiB/s.
MapBatches(ScoringWrapper), 160 actors [26 locality hits, 337 misses]:   9%|▊         | 31/363 [01:14<18:28,  3.34s/it](raylet, ip=10.0.58.195) Spilled 25384 MiB, 81 objects, write throughput 579 MiB/s.
(raylet, ip=10.0.58.195) Spilled 25402 MiB, 88 objects, write throughput 579 MiB/s.
MapBatches(ScoringWrapper), 102 actors [26 locality hits, 337 misses]:  26%|██▌       | 93/363 [02:00<02:10,  2.07it/s](raylet, ip=10.0.28.63) Spilled 24912 MiB, 93 objects, write throughput 566 MiB/s.
MapBatches(ScoringWrapper), 81 actors [26 locality hits, 337 misses]:  32%|███▏      | 115/363 [02:03<00:42,  5.80it/s](raylet, ip=10.0.28.63) Spilled 24925 MiB, 99 objects, write throughput 566 MiB/s.
MapBatches(ScoringWrapper), 0 actors [26 locality hits, 337 misses]: 100%|██████████| 363/363 [02:10<00:00,  2.78it/s] 
run_xgboost_prediction takes 1544.04337004 seconds.
Results: {'training_time': 778.8808449190001, 'prediction_time': 1544.04337004}
Traceback (most recent call last):
  File "workloads/xgboost_benchmark.py", line 179, in <module>
    main(args)
  File "workloads/xgboost_benchmark.py", line 159, in main
    f"Batch prediction on XGBoost is taking {prediction_time} seconds, "
RuntimeError: Batch prediction on XGBoost is taking 1544.04337004 seconds, which is longer than expected (450 seconds).
Subprocess return code: 1

The failing run:

Past failures

This test has failed in the past for similar reasons (batch prediction taking a long time). Was that ever resolved?

Screen Shot 2023-02-24 at 9 34 54 PM

Buildkite for one of these past failures: https://buildkite.com/ray-project/release-tests-branch/builds/1314#0186042a-ebf4-4f7c-b5c0-ddc950c23766

justinvyu commented 1 year ago

Another failure (480s > 450s): https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_pw8rwtedwq9xwnisjqgzpbxx2x?pjd-section=last-log

justinvyu commented 1 year ago

@bveeramani I think this is still failing. Can you take a look?

amogkam commented 1 year ago

The last 4 runs have been successful