ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.29k stars 5.83k forks source link

Release test long_running_many_actor_tasks failed #40568

Closed vitsai closed 1 year ago

vitsai commented 1 year ago

https://buildkite.com/ray-project/release-tests-branch/builds/2282#018b5007-4248-48e1-b2af-40f41b7ba51f

anyscalesam commented 1 year ago

@vitsai is this blocking ray28 release?

vitsai commented 1 year ago

I believe all failed release tests on the release branch are blockers, yes.

rynewang commented 1 year ago

ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_e9jfg33ubd76vpgh6vjtuctfft

rynewang commented 1 year ago

Logs for OOM after a good 6h of working on the many_actor_tasks.py work:

Memory on the node (IP: 10.0.30.108, ID: 5a2a082d5ae843d5f7a75365957cbc63ea1ee5c18b112b7e568553b8) where the task (actor ID: 7610140d413102328133bef401000000, name=Actor.__init__, pid=1900, memory used=0.06GB) was running was 27.57GB / 28.80GB (0.95745), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a05567e1855b961abd91ad70da14b4c14774c76da94cd9bdcbd8b556) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.0.30.108`. To see the logs of the worker, use `ray logs worker-a05567e1855b961abd91ad70da14b4c14774c76da94cd9bdcbd8b556*out -ip 10.0.30.108. Top 10 memory users:
PID     MEM(GB) COMMAND
912     22.01   /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s...
897     0.23    python workloads/many_actor_tasks.py
179     0.18    /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s...
288     0.08    /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/agen...
85      0.07    /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session web_terminal_server --deploy...
290     0.07    /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runti...
230     0.06    /home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/dashboa...
836     0.06    ray::JobSupervisor
73      0.06    /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/jupyter-lab --allow-root --ip=127.0.0.1 --no-...
1899    0.06    ray::Actor
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
Unexpected error occurred: Task was killed due to the node running low on memory.

The gcs_server eats 22GB memory which OOMs the whole system. Looking at the dash:

image

The node has a steady 15.6GB mem usage until the very last minute (2023-10-21 10:38) the mem jumped to 27.8GB and Ray OOMs.

Q:

Note: this grafana screenshot is for the unit test control plane session, not the real workload session.

rkooo567 commented 1 year ago

@rynewang is it possible to run this test in 2.7 and if it shows a similar mem usage?

rkooo567 commented 1 year ago

I think the risk is that it is a memory leak introduced in 2.8

rynewang commented 1 year ago

running side by side workload 2.7.1optimized vs 3.0(a random commit on master)

rkooo567 commented 1 year ago

Btw, for the answers;

why is OOM killer reporting 22GB vs 27.8GB?

27.57GB / 28.80GB (0.95745), -> actually this seems correct.

why is gcs server eating a lot more mem at that time, since the workload is steady (infinite loop of making hundreds of 1MB method calls)?

Maybe you can also post the log of gcs_server.out when this happens?

rkooo567 commented 1 year ago

Was there any failure or sth like that from actors? When actors only run tasks, it should not touch GCS at all

rickyyx commented 1 year ago

Here is the full range of commits, with this being the most likely culprit: 3e8278dc88 [core][dashboard] Task backend GC policy - GCS refactor [2/3] (#38792)

25b57d04c7 add dask version (#40537) 0fc52f7a6e Change version numbers in 2.8 release (#40515) dfdd43cf40 pick of #40525 e6c67051b5 pick of #40525 96efc33a66 make sure tests run serially within each docker (#40509) add95611b7 Increase timeout for test basic 4 (#40492) 3e8278dc88 [core][dashboard] Task backend GC policy - GCS refactor [2/3] (#38792) cfdc6e0e0b [ml] remove alpa release tests (#40510) 66cfec94d2 Revert "[core] Fix placement groups scheduling when no resources are specified (#39946)" (#40506) a1ac74f7b4 [ci] Change the owner of cluster launcher related tests to clusters team. (#40424) 2d726098cf [Doc] [KubeRay] [RayJob] Add info about submitter pod template (#40158) 3c0476aa9c [ci] support gpu core assignment per test shard af332f41e9 [core] Fix placement groups scheduling when no resources are specified (#39946) b5ef0ae7a2 [serve] Add microbenchmarks for streaming HTTP and DeploymentHandle calls (#40498) f9de8555ca [Data] Move _fetch_metadata_parallel to file_meta_provider (#40295) 318fd579c1 [Data] Fix bug where _StatsActor errors with PandasBlock (#40481) d8f2527d5f [ci] move train/serve/default minimal tests to civ2 (#40454) 1fb6147a26 [data] add dataset name (#40430) f3bc522d76 [Data] Remove deprecated do_write (#40422) 7c44833720 [RLlib] Issue 39586: Fix dict space restoration from serialized (ordered dict vs normal dict provided by user). (#39627) 7fa1c28acd [serve] Migrate workflow tests using v1 api (#40472) f4b5f6b673 [owners] remove code owners that are no longer active. (#40476) b83b591605 [Cluster launcher] [vSphere] avoid to fetch private ip (#40204) 9ba85ae83e [Serve] Get rid of ray cluster setup for test_schema test (#40469) bdc9f83ff2 [RLlib] Add on_checkpoint_loaded callback AND also store eval workers' policy_mapping_fn in algo state. (#40350) 4f6c28f543 [Serve] [Docs] Update Serve docs to use the dashboard head instead of the agent (#40474) 1845f1c9f2 [Serve] Support arg builders with non-Pydantic type hints (#40471) a6bc5ac0ff [RLlib] Issue 40312: Better documentation on how to do inference with DreamerV3 (once trained). (#40448) d28f6453f7 [core] Error check Redis get requests (#40333) 56f6adcc97 [core] Fix placement group invariant of PlacementResources being superset of Resources (#40038) 149536d0f4 migrate rllib gpu tests to civ2 (#40439) f077834217 [ci] support debug builds (#40466) 1a08c16813 [Train][Templates] Add LoRA support to Llama-2 finetuning example (#37794) d6baf12425 [core] Fix session key check (#40468) f905171540 [Doc] Fix streaming generator doc code #40447 bf1c581e84 [Data][Docs] Add Dataset.write_sql to API reference (#40473) 779c08a26d [Train] Update Lightning RayDDPStrategy docstring (#40376) 819733f0d6 [Data] Improve error message when reading HTTP files (#40462) 8e86f25118 [serve] Fix linkcheck + remove deprecated rest api (#40464) 4d93e37c01 [Data] Deflake Data CI test suites: test_stats, test_streaming_executor, test_object_gc (#40457) b60c1723be [Core] [runtime env] Fix get_wheel_filename being out of date (#39965) d80fd1d1a2 [data] link dataset ids in constructor, return correct metrics id for materialize (#40413) 7c5b27516d [ci] move ray on spark test to civ2 (#40438) bfe026f461 [civ2][gpu/4] migrate rllib multi-gpu tests to civ2 (#40379) c6347d56a9 [serve] Remove custom FastAPI encoders (#40449) b5eae24692 minimal (#40433) 3052a8d116 migrate ml tests to civ02 (#40440) c44765ab34 [ci] mark dataset_shuffle_push_based_random_shuffle_100tb.aws as unstable (#40437) 3e13e7c17a [serve] Remove extra comment (#40441) 8f5cd610b1 [data] ray data dashboard config (#40195) 2da60b7f30 [runtime_env]: Remove hypen from profiler config (#40395) 5f832b3346 [core][streaming][python] Fix asyncio.wait coroutines args deprecated warnings #40292 8a7d674662 [ci] migrate debug + asan core builds to civ2 (#40418) 512e6adb36 [civ2][gpu/3] create rllib gpu builds (#40364) 6a5215ccad [unjail] test_redis_tls (#40423) 6ba659f920 [Data] Cap op concurrency with exponential ramp-up (#40275) 929b445d0a [tune] remove test_client.py (#40415) b3c14249e1 move serve release tests to civ2 (#40414) 5205a2d121 [ci] change to oss tags (#40428) f10c25914d Revert "[docker image] use buildkit to build ray image (#40365)" (#40427) 11fb194050 [Data] Move BlockWritePathProvider to separate file (#40302) 56b72a542a [Data] Remove out-of-date Data examples (#40127) c91ee0fecc [tune] remove TuneRichReporter (#40169) 4d0c05b00e release test infra still relies on python 3.7 (#40407) 9d9e7c37c1 [serve] Clean up test_metrics.py::test_queued_queries_disconnect (#40410) 58b26141d8 [serve] Migrate v1 api release tests (#40372) ff58667da5 [serve] Outdated API cleanup in docs (#40404) c16082fa01 [ci] add special tag for ray and ray-ml image steps (#40394) bc80271c5e Update CODEOWNERS (#40268) ac73b159d6 [serve] Fix flaky test_autoscaling_policy on windows (#40411) c4ad3d5a2a [jobs] Fix recovery race condition in JobManager (#40068) 7bd1d3a0b2 [Data] Deprecate extraneous Dataset parameters and methods (#40385) 405e82ab64 [RLlib] Issue deprecation warnings for all rllib_contrib algos. (#40147) e28e2a65a6 [serve] Deprecate DAG API (#40290) f787843b7f [Doc][KubeRay] Add a section for Redis cleanup (#40308) 71e893c59e [doc] Add vSphere version requirement in user guide (#40284) 67ec4476e3 [Doc] Logging: Add Fluent Bit DaemonSet and CloudWatch for persistence (#39895) bde327fcdc [Jobs] append error trace to job driver logs (#40380) a8b24dae32 [Serve] Fix the benchmark import error (#40381) 574eb54bee [serve] Migrate v1 api tests (#40363) 16da48491e [Core] Introduce AcceleratorManager interface (#40286) 56affb7e4b [RLlib] Fix BC release test failure. (#40371) dc944fe7d9 [Dependencies] Remove pickle5 backport (#40338) 58cd807bf5 [dashboard] Remove /api/snapshot endpoint (#40269) a2ef28db16 [Core] Bugfix/runtime agent binding (#40092) (#40311) 8fa1565053 [deflakey] Deflakey test_redis_tls (#40378) 820aad1836 [civ2][gpu/2] migrate ml gpu tests to civ2 (#40362) ad7e1fc2ee [Train] Deprecate TransformersTrainer (#40277) 89eb6da181 [Train] Update checkpoint path for RayTrainReportCallbacks. (#40174) dd6eb71fdd [Dependencies] Remove typing_extensions (#40336) 4113ab42bc [runtime env]: Integrating Nsight to Ray worker process (#39998) c6baff26d7 [Train] Fix lightning 2.0 import path (#40266) 199b6cacdf [RLlib][Docs] Add mobile-env to RLlib community examples (#37641) 1a286fd255 [Train] Deprecate AccelerateTrainer (#40274) b3c7af543b fix (#40374) 40275a944e [KubeRay][Autoscaler] Make KubeRay CRD version configurable (#40357) 5c3f100dc3 [serve] Deprecate single app config (#40329) 0c06bb9894 [data] store ray dashboard metrics in _StatsActor (#40118) 941ac71e43 [serve] Fix deploy config edge case bug (#40326) 3d2d4fe816 [data] Allow setting target max block size per-op instead of per-Dataset and reduce for streaming maps (#39710) da5046e76a [docker image] use buildkit to build ray image (#40365) d9e24f2d59 [civ2][gpu/1] create ml gpu builds (#40322) 563a9bf32a Jail //python/ray/tests:test_redis_tls (#40366) 09d4f0ab72 [Data] Fix return type and docstring for iter APIs (#40361) c49b8ed244 [Data] Fix documentation link for local shuffle (#40291) 8310ce11df [Data] Remove BulkExecutor code path (#40200) 56337e04b7 [data] Add function arg params to map and flat_map (#40010) ba581a3d60 [serve] Initial pydantic>=2.0 compatibility (#40222) b31a5aaf0d [serve] Remove v1 api (#40218) 4ab0ba0823 [Data] Remove FileMetadataShuffler (#40341) 306c71438c [Doc] Streaming generator alpha doc (#39914) 8d286f03ce [RLlib-contrib] Dreamer(V1) (won't be moved into rllib_contrib, b/c we now have DreamerV3). (#36621) f097cd4512 [RLlib] Remove some deprecation warnings that should not be there. (#39984)

image
rickyyx commented 1 year ago

So it's confirmed https://github.com/ray-project/ray/commit/3e8278dc8809274a7c324797897223b8a5b8bc5b is the root cause.

Reason:

The only unknown is why on the metric page, this doesn't show up as a slowly increase but a burst. But the bisection is kind of conclusive:

vitsai commented 1 year ago

This is merged into release branch, can we close it soon?