ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[AIR] Significant data reading regression in Ray cluster from xgboost 100GB test #26995

Closed jiaodong closed 2 years ago

jiaodong commented 2 years ago

What happened + What you expected to happen

Issues observed

1) We're not able to evenly distribute read tasks evenly across a cluster anymore with significant skew that lead to imbalanced memory pattern 2) Significantly increased memory cost on headnode that easily lead to OOM

Regression happend between ray commit 8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413 and aadd82dcbd6bb0a8083550ef3edf39c98bf08ce0, roughly in past 3 days from now.

Good nightly release run

Link to good nightly release run: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_r2ZVKt7AzSsY1seXN5hqene4?command-history-section=command_history

Logs: https://gist.github.com/jiaodong/6fd5728e35b23e6f78c4c6049b754d09

Pip freeze: https://gist.github.com/jiaodong/7d84cc4a73eb80d2a7b40dc31b507138

Bad nightly release run

Link to bad nightly release run: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_egWTLrS2PYDUJsQYzeKjiqzP?command-history-section=head_start_up_log

Logs: https://gist.github.com/jiaodong/73d716ea47a9319aa11e629719a1d735

Pip freeze: https://gist.github.com/jiaodong/8ea91f79b7af6602479a970b751a8679

Good run memory usage metrics

Screen Shot 2022-07-25 at 5 15 02 PM

Bad run memory usage metrics

Screen Shot 2022-07-25 at 5 15 43 PM

Corresponding Ray dashboard stats

Good run ray dashboard

Screen Shot 2022-07-25 at 5 14 39 PM

Bad run ray dashboard

Screen Shot 2022-07-25 at 5 15 38 PM

Versions / Dependencies

On master

Reproduction script

Re-run xgboost_benchmark.py

https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py

with 100GB data on 10 node ray cluster

max_workers: 9
--
  |  
  | head_node_type:
  | name: head_node
  | instance_type: m5.4xlarge
  |  
  | worker_node_types:
  | - name: worker_node
  | instance_type: m5.4xlarge
  | max_workers: 9
  | min_workers: 9
  | use_spot: false
  |  
  | aws:
  | BlockDeviceMappings:
  | - DeviceName: /dev/sda1
  | Ebs:
  | Iops: 5000
  | Throughput: 1000
  | VolumeSize: 1000
  | VolumeType: gp3

Issue Severity

High: It blocks me from completing my task.

jiaodong commented 2 years ago

I did an experiment by pinning pip3 install -U Werkzeug==2.1.2 on the latest run in case Werkzeug=2.2.0 broke it, but the same issue remains.

Deps diff from pip freeze

Good                                                         Bad
importlib-resources==5.8.0                   importlib-resources==5.9.0
jax==0.3.14                                              jax==0.3.15
jaxlib==0.3.14                                          jaxlib==0.3.15
lz4==4.0.1                                                lz4==4.0.2
regex==2022.7.9                                     regex==2022.7.24
torchmetrics==0.9.2                               torchmetrics==0.9.3

(good) ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/master/aadd82dcbd6bb0a8083550ef3edf39c98bf08ce0/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

(bad) ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/master/8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

xwjiang2010 commented 2 years ago

The last working run is last Saturday at 3pm The next commit after last working commit: https://github.com/ray-project/ray/commit/da9581b7465c5e1e4903595b4107ed1fa601920f

range: from da9581b7465c5e1e4903595b4107ed1fa601920f to 8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413 both inclusive

scv119 commented 2 years ago

git log --oneline da9581b7465c5e1e4903595b4107ed1fa601920f..8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413 8ecd928c34 [Serve] Make the checkpoint and recover only from GCS (#26753) 193e824bc1 [AIR DOC] minor tweaks to checkpoint user guide for clarity and consistency subheadings (#26937) 1b06e7a83a [tune] Only sync down from cloud if needed (#26725) 4cc1ef1557 [Core] Refactoring Ray DAG object scanner (#26917) df217d15e0 [air] Raise error on path-like access for Checkpoints (#26970) 5315f1e643 [AIR] Enable other notebooks previously marked with # REGRESSION (#26896) 5030a4c1d3 [RLlib] Simplify agent collector (#26803) df638b3f0f [Datasets] Automatically cast tensor columns when building Pandas blocks. (#26924) 0e1b77d52a [Workflow] Fix flaky example(#26960) e8222ff600 [dashboard] Update cluster_activities endpoint to use pydantic. (#26609) aae0aaedbd [air] Un-revert "[air] remove unnecessary logs + improve repr for result" (#26942) bf1d9971f1 [setup-dev] Add flag to skip symlink certain folders (#26899) ec1995a662 [air/tune/docs] Cont. convert Tune examples to use Tuner.fit() (#26959) de7bd015a4 [air/tune/docs] Change Tuner() occurences in rest of ray/tune (#26961) 3ea80f6aa1 [data] set iter_batches default batch_size (#26955) b1594260ba [RLlib] Small SlateQ example fix. (#26948) 41c9ef709a [RLlib] Using PG when not doing microbatching kills A2C performance. (#26844) 794a81028b [ci] add repro-ci-requirements.txt (#26951) bf97a6944b [Dashboard] Actor Table UI Optimize (#26785) 4d6cbb0fd4 [Java]More efficient getAllNodeInfo() (#26872) abde2a5f97 [tune] Fix current best trial progress string for metric=0 (#26943) 4a1ad3e87a [Workflow] Support "retry_exceptions" of Ray tasks (#26913) a012033033 [ci] pin werkzeug (#26950) 15b711ae6a [State Observability] Warn if callsite is disabled when ray list objects + raise exception on missing output (#26880) 1ac2a872e7 [docs] Editing pass over Dataset docs (#26935) d01a80eb11 [core] runtime context resource ids getter (#26907) acbab51d3e [Nightly] fix microbenchmark scripts (#26947) 0c16619475 [core] Make ray able to connect to redis without pip redis. (#25875) 8d7b865614 [air/tuner/docs] Update docs for Tuner() API 2a: Tune examples (non-docs) (#26931) 803c094534 [air/tuner/docs] Update docs for Tuner() API 2b: Tune examples (ipynb) (#26884) 008eecfbff [docs] Update the AIR data ingest guide (#26909) e19cf164fd [Datasets] Use sampling to estimate in-memory data size for Parquet data source (#26868) 8fe439998e [air/tuner/docs] Update docs for Tuner() API 1: RSTs, docs, move reuse_actors (#26930) c01bb831d4 [hotfix/data] Fix linter for test_split (#26944) e9503dbe2b [RLlib] Push suggested changes from #25652 docs wording Parametric Models Action Masking. (#26793) e9a8f7d9ae [RLlib] Unify gnorm mixin for tf and torch policies. (#26102) c44d9ff397 [core] Fix the deadlock in submit task when actor failed. (#26898) 90cea203be Ray 2.0 API deprecation (#26116) aaab4abad5 [Data][Split] stable version of split with hints (#26778) 37f4692aa8 [State Observability] Fix "No result for get crashing the formatting" and "Filtering not handled properly when key missing in the datum" #26881 d692a55018 [data] Make lazy mode non-experimental (#26934) b32c784c7f [RLLib] RE3 exploration algorithm TF2 framework support (#25221) bcec60d898 Revert "[data] set iter_batches default batch_size #26869 " (#26938)

xwjiang2010 commented 2 years ago

@scv119 I have a workspace coming up with similar setup as release test. Planning to poke a few suspicious PRs. What would you recommend to start with? dataset split ones?

xwjiang2010 commented 2 years ago

Also in the list is Jian's PR: https://github.com/ray-project/ray/pull/26902 Not sure why it's missed by git log

scv119 commented 2 years ago

my impression is the commit starts failing is not using splilt(locality_hints=...) by default, @matthewdeng to confirm. If that's the case, unlikely [Data][Split] stable version of split with hints (https://github.com/ray-project/ray/pull/26778) caused the regression as it's not enabled yet.

xwjiang2010 commented 2 years ago

bisect 1: https://buildkite.com/ray-project/release-tests-branch/builds/823#01823b7b-15b0-4726-90b9-d4a9af468ff4 using https://github.com/alipay/ray/commit/cf57fce21ea13ae9f9f4a665de8161fa35617d4c

c21 commented 2 years ago

Thanks @xwjiang2010 for starting the bisect test.

If "https://github.com/ray-project/ray/commit/1ac2a872e7502e1263f136a7eb42bd1d3f88091f [docs] Editing pass over Dataset docs (https://github.com/ray-project/ray/pull/26935)" still has the issue, next bisect we can test the commit before "https://github.com/ray-project/ray/commit/e19cf164fd51c4f6bf730e999cba46b30c39ff83 [Datasets] Use sampling to estimate in-memory data size for Parquet data source (https://github.com/ray-project/ray/pull/26868)". Though from log, I couldn't find anything suspicious, but my change impacts Parquet read path, which is used in test.

xwjiang2010 commented 2 years ago

still having the issue. trying cut-off at https://github.com/ray-project/ray/commit/e19cf164fd51c4f6bf730e999cba46b30c39ff83 (exclusive) now, as Cheng suggested.

Link: https://buildkite.com/ray-project/release-tests-branch/builds/825

matthewdeng commented 2 years ago

@xwjiang2010 looks like you ended up running it inclusive 😄

Kicking off one with the commit before (8fe439998ecd48b1da5216d882123e6bde3b8fb7): https://buildkite.com/ray-project/release-tests-branch/builds/826

matthewdeng commented 2 years ago

Above test passes, which indicates that this indeed being caused by https://github.com/ray-project/ray/pull/26868 (https://github.com/ray-project/ray/commit/e19cf164fd51c4f6bf730e999cba46b30c39ff83)

image

@c21 could you take a deeper look at what's happening here?

ericl commented 2 years ago

That's interesting. Did the number of blocks increase? That could hit the known data imbalance issue from https://github.com/ray-project/ray/issues/26878

matthewdeng commented 2 years ago

Actually, how much data does the file sampling read? Even if the tasks are being spread, I'm wondering why there is a huge spike in the very beginning. I can play around with this a bit and try to understand a bit more what's going on.

scv119 commented 2 years ago

so the theory is after @c21's change we have increasing number of blocks which leads to uneven spread?

clarkzinzow commented 2 years ago

Are we sure that the sampling reads are reading the expected small subset of data? The recent timing results for file sampling is much slower than I would expect, on the order of 10s of seconds while I'd expect sampling to be near sub-second.

c21 commented 2 years ago

After applying the corresponding fix below, I am able to get nightly test passed with sampling enabled:

Screen Shot 2022-07-26 at 10 15 29 PM

Root cause analysis:

After some digging, I found there are two issues:

Here's an example to read one Parquet file, and reader allocated ~400MB memory, but the actual file size in-memory is just 90MB.

>>> import pyarrow
>>> import pyarrow.parquet as pq
>>> 
>>> pq_ds = pq.ParquetDataset("xgboost_0.parquet", use_legacy_dataset=False)
>>> piece = pq_ds.pieces[0]
<stdin>:1: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Use the '.fragments' attribute instead
>>> piece.metadata
<pyarrow._parquet.FileMetaData object at 0x7f8b488ec130>
  created_by: parquet-cpp-arrow version 6.0.1
  num_columns: 43
  num_rows: 260000
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 21785
>>> num_rows = 5
>>> pyarrow.default_memory_pool().max_memory()
65536
>>> piece.head(num_rows, batch_size=num_rows).nbytes
1763
>>> pyarrow.default_memory_pool().max_memory()
390484736
>>> pyarrow.default_memory_pool().bytes_allocated()
0
>>> ds = ray.data.read_parquet("/Users/chengsu/try/parquet/xgboost_0.parquet")
>>> ds.fully_executed().size_bytes()
90837500