Closed vitsai closed 1 year ago
@vitsai is this blocking ray28 release?
I believe all failed release tests on the release branch are blockers, yes.
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_e9jfg33ubd76vpgh6vjtuctfft
Logs for OOM after a good 6h of working on the many_actor_tasks.py
work:
Memory on the node (IP: 10.0.30.108, ID: 5a2a082d5ae843d5f7a75365957cbc63ea1ee5c18b112b7e568553b8) where the task (actor ID: 7610140d413102328133bef401000000, name=Actor.__init__, pid=1900, memory used=0.06GB) was running was 27.57GB / 28.80GB (0.95745), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a05567e1855b961abd91ad70da14b4c14774c76da94cd9bdcbd8b556) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.0.30.108`. To see the logs of the worker, use `ray logs worker-a05567e1855b961abd91ad70da14b4c14774c76da94cd9bdcbd8b556*out -ip 10.0.30.108. Top 10 memory users:
PID MEM(GB) COMMAND
912 22.01 /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s...
897 0.23 python workloads/many_actor_tasks.py
179 0.18 /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s...
288 0.08 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/agen...
85 0.07 /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session web_terminal_server --deploy...
290 0.07 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runti...
230 0.06 /home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/dashboa...
836 0.06 ray::JobSupervisor
73 0.06 /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/jupyter-lab --allow-root --ip=127.0.0.1 --no-...
1899 0.06 ray::Actor
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
Unexpected error occurred: Task was killed due to the node running low on memory.
The gcs_server
eats 22GB memory which OOMs the whole system. Looking at the dash:
The node has a steady 15.6GB mem usage until the very last minute (2023-10-21 10:38) the mem jumped to 27.8GB and Ray OOMs.
Q:
Note: this grafana screenshot is for the unit test control plane session, not the real workload session.
@rynewang is it possible to run this test in 2.7 and if it shows a similar mem usage?
I think the risk is that it is a memory leak introduced in 2.8
running side by side workload 2.7.1optimized vs 3.0(a random commit on master)
Btw, for the answers;
why is OOM killer reporting 22GB vs 27.8GB?
27.57GB / 28.80GB (0.95745),
-> actually this seems correct.
why is gcs server eating a lot more mem at that time, since the workload is steady (infinite loop of making hundreds of 1MB method calls)?
Maybe you can also post the log of gcs_server.out when this happens?
Was there any failure or sth like that from actors? When actors only run tasks, it should not touch GCS at all
Here is the full range of commits, with this being the most likely culprit: 3e8278dc88 [core][dashboard] Task backend GC policy - GCS refactor [2/3] (#38792)
25b57d04c7 add dask version (#40537)
0fc52f7a6e Change version numbers in 2.8 release (#40515)
dfdd43cf40 pick of #40525
e6c67051b5 pick of #40525
96efc33a66 make sure tests run serially within each docker (#40509)
add95611b7 Increase timeout for test basic 4 (#40492)
3e8278dc88 [core][dashboard] Task backend GC policy - GCS refactor [2/3] (#38792)
cfdc6e0e0b [ml] remove alpa release tests (#40510)
66cfec94d2 Revert "[core] Fix placement groups scheduling when no resources are specified (#39946)" (#40506)
a1ac74f7b4 [ci] Change the owner of cluster launcher related tests to clusters team. (#40424)
2d726098cf [Doc] [KubeRay] [RayJob] Add info about submitter pod template (#40158)
3c0476aa9c [ci] support gpu core assignment per test shard
af332f41e9 [core] Fix placement groups scheduling when no resources are specified (#39946)
b5ef0ae7a2 [serve] Add microbenchmarks for streaming HTTP and DeploymentHandle
calls (#40498)
f9de8555ca [Data] Move _fetch_metadata_parallel
to file_meta_provider
(#40295)
318fd579c1 [Data] Fix bug where _StatsActor
errors with PandasBlock
(#40481)
d8f2527d5f [ci] move train/serve/default minimal tests to civ2 (#40454)
1fb6147a26 [data] add dataset name (#40430)
f3bc522d76 [Data] Remove deprecated do_write
(#40422)
7c44833720 [RLlib] Issue 39586: Fix dict space restoration from serialized (ordered dict vs normal dict provided by user). (#39627)
7fa1c28acd [serve] Migrate workflow tests using v1 api (#40472)
f4b5f6b673 [owners] remove code owners that are no longer active. (#40476)
b83b591605 [Cluster launcher] [vSphere] avoid to fetch private ip (#40204)
9ba85ae83e [Serve] Get rid of ray cluster setup for test_schema test (#40469)
bdc9f83ff2 [RLlib] Add on_checkpoint_loaded
callback AND also store eval workers' policy_mapping_fn
in algo state. (#40350)
4f6c28f543 [Serve] [Docs] Update Serve docs to use the dashboard head instead of the agent (#40474)
1845f1c9f2 [Serve] Support arg builders with non-Pydantic type hints (#40471)
a6bc5ac0ff [RLlib] Issue 40312: Better documentation on how to do inference with DreamerV3 (once trained). (#40448)
d28f6453f7 [core] Error check Redis get requests (#40333)
56f6adcc97 [core] Fix placement group invariant of PlacementResources being superset of Resources (#40038)
149536d0f4 migrate rllib gpu tests to civ2 (#40439)
f077834217 [ci] support debug builds (#40466)
1a08c16813 [Train][Templates] Add LoRA support to Llama-2 finetuning example (#37794)
d6baf12425 [core] Fix session key check (#40468)
f905171540 [Doc] Fix streaming generator doc code #40447
bf1c581e84 [Data][Docs] Add Dataset.write_sql
to API reference (#40473)
779c08a26d [Train] Update Lightning RayDDPStrategy docstring (#40376)
819733f0d6 [Data] Improve error message when reading HTTP files (#40462)
8e86f25118 [serve] Fix linkcheck + remove deprecated rest api (#40464)
4d93e37c01 [Data] Deflake Data CI test suites: test_stats
, test_streaming_executor
, test_object_gc
(#40457)
b60c1723be [Core] [runtime env] Fix get_wheel_filename being out of date (#39965)
d80fd1d1a2 [data] link dataset ids in constructor, return correct metrics id for materialize (#40413)
7c5b27516d [ci] move ray on spark test to civ2 (#40438)
bfe026f461 [civ2][gpu/4] migrate rllib multi-gpu tests to civ2 (#40379)
c6347d56a9 [serve] Remove custom FastAPI encoders (#40449)
b5eae24692 minimal (#40433)
3052a8d116 migrate ml tests to civ02 (#40440)
c44765ab34 [ci] mark dataset_shuffle_push_based_random_shuffle_100tb.aws as unstable (#40437)
3e13e7c17a [serve] Remove extra comment (#40441)
8f5cd610b1 [data] ray data dashboard config (#40195)
2da60b7f30 [runtime_env]: Remove hypen from profiler config (#40395)
5f832b3346 [core][streaming][python] Fix asyncio.wait coroutines args deprecated warnings #40292
8a7d674662 [ci] migrate debug + asan core builds to civ2 (#40418)
512e6adb36 [civ2][gpu/3] create rllib gpu builds (#40364)
6a5215ccad [unjail] test_redis_tls (#40423)
6ba659f920 [Data] Cap op concurrency with exponential ramp-up (#40275)
929b445d0a [tune] remove test_client.py (#40415)
b3c14249e1 move serve release tests to civ2 (#40414)
5205a2d121 [ci] change to oss tags (#40428)
f10c25914d Revert "[docker image] use buildkit to build ray image (#40365)" (#40427)
11fb194050 [Data] Move BlockWritePathProvider
to separate file (#40302)
56b72a542a [Data] Remove out-of-date Data examples (#40127)
c91ee0fecc [tune] remove TuneRichReporter (#40169)
4d0c05b00e release test infra still relies on python 3.7 (#40407)
9d9e7c37c1 [serve] Clean up test_metrics.py::test_queued_queries_disconnect
(#40410)
58b26141d8 [serve] Migrate v1 api release tests (#40372)
ff58667da5 [serve] Outdated API cleanup in docs (#40404)
c16082fa01 [ci] add special tag for ray and ray-ml image steps (#40394)
bc80271c5e Update CODEOWNERS
(#40268)
ac73b159d6 [serve] Fix flaky test_autoscaling_policy on windows (#40411)
c4ad3d5a2a [jobs] Fix recovery race condition in JobManager
(#40068)
7bd1d3a0b2 [Data] Deprecate extraneous Dataset
parameters and methods (#40385)
405e82ab64 [RLlib] Issue deprecation warnings for all rllib_contrib
algos. (#40147)
e28e2a65a6 [serve] Deprecate DAG API (#40290)
f787843b7f [Doc][KubeRay] Add a section for Redis cleanup (#40308)
71e893c59e [doc] Add vSphere version requirement in user guide (#40284)
67ec4476e3 [Doc] Logging: Add Fluent Bit DaemonSet and CloudWatch for persistence (#39895)
bde327fcdc [Jobs] append error trace to job driver logs (#40380)
a8b24dae32 [Serve] Fix the benchmark import error (#40381)
574eb54bee [serve] Migrate v1 api tests (#40363)
16da48491e [Core] Introduce AcceleratorManager interface (#40286)
56affb7e4b [RLlib] Fix BC release test failure. (#40371)
dc944fe7d9 [Dependencies] Remove pickle5 backport (#40338)
58cd807bf5 [dashboard] Remove /api/snapshot
endpoint (#40269)
a2ef28db16 [Core] Bugfix/runtime agent binding (#40092) (#40311)
8fa1565053 [deflakey] Deflakey test_redis_tls
(#40378)
820aad1836 [civ2][gpu/2] migrate ml gpu tests to civ2 (#40362)
ad7e1fc2ee [Train] Deprecate TransformersTrainer (#40277)
89eb6da181 [Train] Update checkpoint path for RayTrainReportCallbacks. (#40174)
dd6eb71fdd [Dependencies] Remove typing_extensions (#40336)
4113ab42bc [runtime env]: Integrating Nsight to Ray worker process (#39998)
c6baff26d7 [Train] Fix lightning 2.0 import path (#40266)
199b6cacdf [RLlib][Docs] Add mobile-env to RLlib community examples (#37641)
1a286fd255 [Train] Deprecate AccelerateTrainer (#40274)
b3c7af543b fix (#40374)
40275a944e [KubeRay][Autoscaler] Make KubeRay CRD version configurable (#40357)
5c3f100dc3 [serve] Deprecate single app config (#40329)
0c06bb9894 [data] store ray dashboard metrics in _StatsActor (#40118)
941ac71e43 [serve] Fix deploy config edge case bug (#40326)
3d2d4fe816 [data] Allow setting target max block size per-op instead of per-Dataset and reduce for streaming maps (#39710)
da5046e76a [docker image] use buildkit to build ray image (#40365)
d9e24f2d59 [civ2][gpu/1] create ml gpu builds (#40322)
563a9bf32a Jail //python/ray/tests:test_redis_tls (#40366)
09d4f0ab72 [Data] Fix return type and docstring for iter APIs (#40361)
c49b8ed244 [Data] Fix documentation link for local shuffle (#40291)
8310ce11df [Data] Remove BulkExecutor
code path (#40200)
56337e04b7 [data] Add function arg params to map and flat_map (#40010)
ba581a3d60 [serve] Initial pydantic>=2.0
compatibility (#40222)
b31a5aaf0d [serve] Remove v1 api (#40218)
4ab0ba0823 [Data] Remove FileMetadataShuffler (#40341)
306c71438c [Doc] Streaming generator alpha doc (#39914)
8d286f03ce [RLlib-contrib] Dreamer(V1) (won't be moved into rllib_contrib
, b/c we now have DreamerV3). (#36621)
f097cd4512 [RLlib] Remove some deprecation warnings that should not be there. (#39984)
So it's confirmed https://github.com/ray-project/ray/commit/3e8278dc8809274a7c324797897223b8a5b8bc5b is the root cause.
Reason:
The only unknown is why on the metric page, this doesn't show up as a slowly increase but a burst. But the bisection is kind of conclusive:
This is merged into release branch, can we close it soon?
https://buildkite.com/ray-project/release-tests-branch/builds/2282#018b5007-4248-48e1-b2af-40f41b7ba51f