Closed jiaodong closed 2 years ago
I did an experiment by pinning pip3 install -U Werkzeug==2.1.2
on the latest run in case Werkzeug=2.2.0
broke it, but the same issue remains.
Deps diff from pip freeze
Good Bad
importlib-resources==5.8.0 importlib-resources==5.9.0
jax==0.3.14 jax==0.3.15
jaxlib==0.3.14 jaxlib==0.3.15
lz4==4.0.1 lz4==4.0.2
regex==2022.7.9 regex==2022.7.24
torchmetrics==0.9.2 torchmetrics==0.9.3
The last working run is last Saturday at 3pm The next commit after last working commit: https://github.com/ray-project/ray/commit/da9581b7465c5e1e4903595b4107ed1fa601920f
range: from da9581b7465c5e1e4903595b4107ed1fa601920f to 8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413 both inclusive
git log --oneline da9581b7465c5e1e4903595b4107ed1fa601920f..8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413
8ecd928c34 [Serve] Make the checkpoint and recover only from GCS (#26753)
193e824bc1 [AIR DOC] minor tweaks to checkpoint user guide for clarity and consistency subheadings (#26937)
1b06e7a83a [tune] Only sync down from cloud if needed (#26725)
4cc1ef1557 [Core] Refactoring Ray DAG object scanner (#26917)
df217d15e0 [air] Raise error on path-like access for Checkpoints (#26970)
5315f1e643 [AIR] Enable other notebooks previously marked with # REGRESSION (#26896)
5030a4c1d3 [RLlib] Simplify agent collector (#26803)
df638b3f0f [Datasets] Automatically cast tensor columns when building Pandas blocks. (#26924)
0e1b77d52a [Workflow] Fix flaky example(#26960)
e8222ff600 [dashboard] Update cluster_activities endpoint to use pydantic. (#26609)
aae0aaedbd [air] Un-revert "[air] remove unnecessary logs + improve repr for result" (#26942)
bf1d9971f1 [setup-dev] Add flag to skip symlink certain folders (#26899)
ec1995a662 [air/tune/docs] Cont. convert Tune examples to use Tuner.fit() (#26959)
de7bd015a4 [air/tune/docs] Change Tuner() occurences in rest of ray/tune (#26961)
3ea80f6aa1 [data] set iter_batches default batch_size (#26955)
b1594260ba [RLlib] Small SlateQ example fix. (#26948)
41c9ef709a [RLlib] Using PG when not doing microbatching kills A2C performance. (#26844)
794a81028b [ci] add repro-ci-requirements.txt (#26951)
bf97a6944b [Dashboard] Actor Table UI Optimize (#26785)
4d6cbb0fd4 [Java]More efficient getAllNodeInfo() (#26872)
abde2a5f97 [tune] Fix current best trial progress string for metric=0 (#26943)
4a1ad3e87a [Workflow] Support "retry_exceptions" of Ray tasks (#26913)
a012033033 [ci] pin werkzeug (#26950)
15b711ae6a [State Observability] Warn if callsite is disabled when ray list objects
+ raise exception on missing output (#26880)
1ac2a872e7 [docs] Editing pass over Dataset docs (#26935)
d01a80eb11 [core] runtime context resource ids getter (#26907)
acbab51d3e [Nightly] fix microbenchmark scripts (#26947)
0c16619475 [core] Make ray able to connect to redis without pip redis. (#25875)
8d7b865614 [air/tuner/docs] Update docs for Tuner() API 2a: Tune examples (non-docs) (#26931)
803c094534 [air/tuner/docs] Update docs for Tuner() API 2b: Tune examples (ipynb) (#26884)
008eecfbff [docs] Update the AIR data ingest guide (#26909)
e19cf164fd [Datasets] Use sampling to estimate in-memory data size for Parquet data source (#26868)
8fe439998e [air/tuner/docs] Update docs for Tuner() API 1: RSTs, docs, move reuse_actors (#26930)
c01bb831d4 [hotfix/data] Fix linter for test_split (#26944)
e9503dbe2b [RLlib] Push suggested changes from #25652 docs wording Parametric Models Action Masking. (#26793)
e9a8f7d9ae [RLlib] Unify gnorm mixin for tf and torch policies. (#26102)
c44d9ff397 [core] Fix the deadlock in submit task when actor failed. (#26898)
90cea203be Ray 2.0 API deprecation (#26116)
aaab4abad5 [Data][Split] stable version of split with hints (#26778)
37f4692aa8 [State Observability] Fix "No result for get crashing the formatting" and "Filtering not handled properly when key missing in the datum" #26881
d692a55018 [data] Make lazy mode non-experimental (#26934)
b32c784c7f [RLLib] RE3 exploration algorithm TF2 framework support (#25221)
bcec60d898 Revert "[data] set iter_batches default batch_size #26869 " (#26938)
@scv119 I have a workspace coming up with similar setup as release test. Planning to poke a few suspicious PRs. What would you recommend to start with? dataset split ones?
Also in the list is Jian's PR: https://github.com/ray-project/ray/pull/26902
Not sure why it's missed by git log
my impression is the commit starts failing is not using splilt(locality_hints=...)
by default, @matthewdeng to confirm. If that's the case, unlikely
[Data][Split] stable version of split with hints (https://github.com/ray-project/ray/pull/26778)
caused the regression as it's not enabled yet.
Thanks @xwjiang2010 for starting the bisect test.
If "https://github.com/ray-project/ray/commit/1ac2a872e7502e1263f136a7eb42bd1d3f88091f [docs] Editing pass over Dataset docs (https://github.com/ray-project/ray/pull/26935)" still has the issue, next bisect we can test the commit before "https://github.com/ray-project/ray/commit/e19cf164fd51c4f6bf730e999cba46b30c39ff83 [Datasets] Use sampling to estimate in-memory data size for Parquet data source (https://github.com/ray-project/ray/pull/26868)". Though from log, I couldn't find anything suspicious, but my change impacts Parquet read path, which is used in test.
still having the issue. trying cut-off at https://github.com/ray-project/ray/commit/e19cf164fd51c4f6bf730e999cba46b30c39ff83 (exclusive) now, as Cheng suggested.
Link: https://buildkite.com/ray-project/release-tests-branch/builds/825
@xwjiang2010 looks like you ended up running it inclusive 😄
Kicking off one with the commit before (8fe439998ecd48b1da5216d882123e6bde3b8fb7): https://buildkite.com/ray-project/release-tests-branch/builds/826
Above test passes, which indicates that this indeed being caused by https://github.com/ray-project/ray/pull/26868 (https://github.com/ray-project/ray/commit/e19cf164fd51c4f6bf730e999cba46b30c39ff83)
@c21 could you take a deeper look at what's happening here?
That's interesting. Did the number of blocks increase? That could hit the known data imbalance issue from https://github.com/ray-project/ray/issues/26878
Actually, how much data does the file sampling read? Even if the tasks are being spread, I'm wondering why there is a huge spike in the very beginning. I can play around with this a bit and try to understand a bit more what's going on.
so the theory is after @c21's change we have increasing number of blocks which leads to uneven spread?
Are we sure that the sampling reads are reading the expected small subset of data? The recent timing results for file sampling is much slower than I would expect, on the order of 10s of seconds while I'd expect sampling to be near sub-second.
After applying the corresponding fix below, I am able to get nightly test passed with sampling enabled:
Root cause analysis:
After some digging, I found there are two issues:
Here's an example to read one Parquet file, and reader allocated ~400MB memory, but the actual file size in-memory is just 90MB.
>>> import pyarrow
>>> import pyarrow.parquet as pq
>>>
>>> pq_ds = pq.ParquetDataset("xgboost_0.parquet", use_legacy_dataset=False)
>>> piece = pq_ds.pieces[0]
<stdin>:1: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Use the '.fragments' attribute instead
>>> piece.metadata
<pyarrow._parquet.FileMetaData object at 0x7f8b488ec130>
created_by: parquet-cpp-arrow version 6.0.1
num_columns: 43
num_rows: 260000
num_row_groups: 1
format_version: 1.0
serialized_size: 21785
>>> num_rows = 5
>>> pyarrow.default_memory_pool().max_memory()
65536
>>> piece.head(num_rows, batch_size=num_rows).nbytes
1763
>>> pyarrow.default_memory_pool().max_memory()
390484736
>>> pyarrow.default_memory_pool().bytes_allocated()
0
>>> ds = ray.data.read_parquet("/Users/chengsu/try/parquet/xgboost_0.parquet")
>>> ds.fully_executed().size_bytes()
90837500
What happened + What you expected to happen
Issues observed
1) We're not able to evenly distribute read tasks evenly across a cluster anymore with significant skew that lead to imbalanced memory pattern 2) Significantly increased memory cost on headnode that easily lead to OOM
Regression happend between ray commit
8ecd928c34db0b23e4aa2a4ea0c8cff25c37b413
andaadd82dcbd6bb0a8083550ef3edf39c98bf08ce0
, roughly in past 3 days from now.Good nightly release run
Link to good nightly release run: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_r2ZVKt7AzSsY1seXN5hqene4?command-history-section=command_history
Logs: https://gist.github.com/jiaodong/6fd5728e35b23e6f78c4c6049b754d09
Pip freeze: https://gist.github.com/jiaodong/7d84cc4a73eb80d2a7b40dc31b507138
Bad nightly release run
Link to bad nightly release run: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_egWTLrS2PYDUJsQYzeKjiqzP?command-history-section=head_start_up_log
Logs: https://gist.github.com/jiaodong/73d716ea47a9319aa11e629719a1d735
Pip freeze: https://gist.github.com/jiaodong/8ea91f79b7af6602479a970b751a8679
Good run memory usage metrics
Bad run memory usage metrics
Corresponding Ray dashboard stats
Good run ray dashboard
Bad run ray dashboard
Versions / Dependencies
On master
Reproduction script
Re-run
xgboost_benchmark.py
https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py
with 100GB data on 10 node ray cluster
Issue Severity
High: It blocks me from completing my task.