[AIR] Significant data reading regression in Ray cluster from xgboost 100GB test

jiaodong commented 2 years ago

What happened + What you expected to happen

Issues observed

1) We're not able to evenly distribute read tasks evenly across a cluster anymore with significant skew that lead to imbalanced memory pattern 2) Significantly increased memory cost on headnode that easily lead to OOM

Regression happend between ray commit 8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413 and aadd82dcbd6bb0a8083550ef3edf39c98bf08ce0, roughly in past 3 days from now.

Good nightly release run

Link to good nightly release run: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_r2ZVKt7AzSsY1seXN5hqene4?command-history-section=command_history

Logs: https://gist.github.com/jiaodong/6fd5728e35b23e6f78c4c6049b754d09

Pip freeze: https://gist.github.com/jiaodong/7d84cc4a73eb80d2a7b40dc31b507138

Bad nightly release run

Link to bad nightly release run: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_egWTLrS2PYDUJsQYzeKjiqzP?command-history-section=head_start_up_log

Logs: https://gist.github.com/jiaodong/73d716ea47a9319aa11e629719a1d735

Pip freeze: https://gist.github.com/jiaodong/8ea91f79b7af6602479a970b751a8679

Good run memory usage metrics

Bad run memory usage metrics

Corresponding Ray dashboard stats

Good run ray dashboard

Bad run ray dashboard

Versions / Dependencies

On master

Reproduction script

Re-run xgboost_benchmark.py

https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py

with 100GB data on 10 node ray cluster

max_workers: 9
--
  |  
  | head_node_type:
  | name: head_node
  | instance_type: m5.4xlarge
  |  
  | worker_node_types:
  | - name: worker_node
  | instance_type: m5.4xlarge
  | max_workers: 9
  | min_workers: 9
  | use_spot: false
  |  
  | aws:
  | BlockDeviceMappings:
  | - DeviceName: /dev/sda1
  | Ebs:
  | Iops: 5000
  | Throughput: 1000
  | VolumeSize: 1000
  | VolumeType: gp3

Issue Severity

High: It blocks me from completing my task.

jiaodong commented 2 years ago

I did an experiment by pinning pip3 install -U Werkzeug==2.1.2 on the latest run in case Werkzeug=2.2.0 broke it, but the same issue remains.

Deps diff from pip freeze

Good                                                         Bad
importlib-resources==5.8.0                   importlib-resources==5.9.0
jax==0.3.14                                              jax==0.3.15
jaxlib==0.3.14                                          jaxlib==0.3.15
lz4==4.0.1                                                lz4==4.0.2
regex==2022.7.9                                     regex==2022.7.24
torchmetrics==0.9.2                               torchmetrics==0.9.3

(good) ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/master/aadd82dcbd6bb0a8083550ef3edf39c98bf08ce0/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

(bad) ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/master/8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl