Release test long_running_many_actor_tasks failed

vitsai commented 1 year ago

https://buildkite.com/ray-project/release-tests-branch/builds/2282#018b5007-4248-48e1-b2af-40f41b7ba51f

anyscalesam commented 1 year ago

@vitsai is this blocking ray28 release?

vitsai commented 1 year ago

I believe all failed release tests on the release branch are blockers, yes.

rynewang commented 1 year ago

ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_e9jfg33ubd76vpgh6vjtuctfft

rynewang commented 1 year ago

Logs for OOM after a good 6h of working on the many_actor_tasks.py work:

Memory on the node (IP: 10.0.30.108, ID: 5a2a082d5ae843d5f7a75365957cbc63ea1ee5c18b112b7e568553b8) where the task (actor ID: 7610140d413102328133bef401000000, name=Actor.__init__, pid=1900, memory used=0.06GB) was running was 27.57GB / 28.80GB (0.95745), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a05567e1855b961abd91ad70da14b4c14774c76da94cd9bdcbd8b556) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.0.30.108`. To see the logs of the worker, use `ray logs worker-a05567e1855b961abd91ad70da14b4c14774c76da94cd9bdcbd8b556*out -ip 10.0.30.108. Top 10 memory users:
PID     MEM(GB) COMMAND
912     22.01   /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s...
897     0.23    python workloads/many_actor_tasks.py
179     0.18    /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s...
288     0.08    /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/agen...
85      0.07    /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session web_terminal_server --deploy...
290     0.07    /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runti...
230     0.06    /home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/dashboa...
836     0.06    ray::JobSupervisor
73      0.06    /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/jupyter-lab --allow-root --ip=127.0.0.1 --no-...
1899    0.06    ray::Actor
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
Unexpected error occurred: Task was killed due to the node running low on memory.

The gcs_server eats 22GB memory which OOMs the whole system. Looking at the dash:

The node has a steady 15.6GB mem usage until the very last minute (2023-10-21 10:38) the mem jumped to 27.8GB and Ray OOMs.

Q:

why is OOM killer reporting 22GB vs 27.8GB?
why is gcs server eating a lot more mem at that time, since the workload is steady (infinite loop of making hundreds of 1MB method calls)?

Note: this grafana screenshot is for the unit test control plane session, not the real workload session.

rkooo567 commented 1 year ago

@rynewang is it possible to run this test in 2.7 and if it shows a similar mem usage?

rkooo567 commented 1 year ago

I think the risk is that it is a memory leak introduced in 2.8

rynewang commented 1 year ago

running side by side workload 2.7.1optimized vs 3.0(a random commit on master)

rkooo567 commented 1 year ago

Btw, for the answers;

why is OOM killer reporting 22GB vs 27.8GB?

27.57GB / 28.80GB (0.95745), -> actually this seems correct.

why is gcs server eating a lot more mem at that time, since the workload is steady (infinite loop of making hundreds of 1MB method calls)?

Maybe you can also post the log of gcs_server.out when this happens?

rkooo567 commented 1 year ago

Was there any failure or sth like that from actors? When actors only run tasks, it should not touch GCS at all

rickyyx commented 1 year ago

Here is the full range of commits, with this being the most likely culprit: 3e8278dc88 [core][dashboard] Task backend GC policy - GCS refactor [2/3] (#38792)

25b57d04c7 add dask version (#40537) 0fc52f7a6e Change version numbers in 2.8 release (#40515) dfdd43cf40 pick of #40525 e6c67051b5 pick of #40525 96efc33a66 make sure tests run serially within each docker (#40509) add95611b7 Increase timeout for test basic 4 (#40492) 3e8278dc88 [core][dashboard] Task backend GC policy - GCS refactor [2/3] (#38792) cfdc6e0e0b [ml] remove alpa release tests (#40510) 66cfec94d2 Revert "[core] Fix placement groups scheduling when no resources are specified (#39946)" (#40506) a1ac74f7b4 [ci] Change the owner of cluster launcher related tests to clusters team. (#40424) 2d726098cf [Doc] [KubeRay] [RayJob] Add info about submitter pod template (#40158) 3c0476aa9c [ci] support gpu core assignment per test shard af332f41e9 [core] Fix placement groups scheduling when no resources are specified (#39946) b5ef0ae7a2 [serve] Add microbenchmarks for streaming HTTP and DeploymentHandle calls (#40498) f9de8555ca [Data] Move _fetch_metadata_parallel to file_meta_provider (#40295) 318fd579c1 [Data] Fix bug where _StatsActor errors with PandasBlock (#40481) d8f2527d5f [ci] move train/serve/default minimal tests to civ2 (#40454) 1fb6147a26 [data] add dataset name (#40430) f3bc522d76 [Data] Remove deprecated do_write (#40422) 7c44833720 [RLlib] Issue 39586: Fix dict space restoration from serialized (ordered dict vs normal dict provided by user). (#39627) 7fa1c28acd [serve] Migrate workflow tests using v1 api (#40472) f4b5f6b673 [owners] remove code owners that are no longer active. (#40476) b83b591605 [Cluster launcher] [vSphere] avoid to fetch private ip (#40204) 9ba85ae83e [Serve] Get rid of ray cluster setup for test_schema test (#40469) bdc9f83ff2 [RLlib] Add on_checkpoint_loaded callback AND also store eval workers' policy_mapping_fn in algo state. (#40350) 4f6c28f543 [Serve] [Docs] Update Serve docs to use the dashboard head instead of the agent (#40474) 1845f1c9f2 [Serve] Support arg builders with non-Pydantic type hints (#40471) a6bc5ac0ff [RLlib] Issue 40312: Better documentation on how to do inference with DreamerV3 (once trained). (#40448) d28f6453f7 [core] Error check Redis get requests (#40333) 56f6adcc97 [core] Fix placement group invariant of PlacementResources being superset of Resources (#40038) 149536d0f4 migrate rllib gpu tests to civ2 (#40439) f077834217 [ci] support debug builds (#40466) 1a08c16813 [Train][Templates] Add LoRA support to Llama-2 finetuning example (#37794) d6baf12425 [core] Fix session key check (#40468) f905171540 [Doc] Fix streaming generator doc code #40447 bf1c581e84 [Data][Docs] Add Dataset.write_sql to API reference (#40473) 779c08a26d [Train] Update Lightning RayDDPStrategy docstring (#40376) 819733f0d6 [Data] Improve error message when reading HTTP files (#40462) 8e86f25118 [serve] Fix linkcheck + remove deprecated rest api (#40464) 4d93e37c01 [Data] Deflake Data CI test suites: test_stats, test_streaming_executor, test_object_gc (#40457) b60c1723be [Core] [runtime env] Fix get_wheel_filename being out of date (#39965) d80fd1d1a2 [data] link dataset ids in constructor, return correct metrics id for materialize (#40413) 7c5b27516d [ci] move ray on spark test to civ2 (#40438) bfe026f461 [civ2][gpu/4] migrate rllib multi-gpu tests to civ2 (#40379) c6347d56a9 [serve] Remove custom FastAPI encoders (#40449) b5eae24692 minimal (#40433) 3052a8d116 migrate ml tests to civ02 (#40440) c44765ab34 [ci] mark dataset_shuffle_push_based_random_shuffle_100tb.aws as unstable (#40437) 3e13e7c17a [serve] Remove extra comment (#40441) 8f5cd610b1 [data] ray data dashboard config (#40195) 2da60b7f30 [runtime_env]: Remove hypen from profiler config (#40395) 5f832b3346 [core][streaming][python] Fix asyncio.wait coroutines args deprecated warnings #40292 8a7d674662 [ci] migrate debug + asan core builds to civ2 (#40418) 512e6adb36 [civ2][gpu/3] create rllib gpu builds (#40364) 6a5215ccad [unjail] test_redis_tls (#40423) 6ba659f920 [Data] Cap op concurrency with exponential ramp-up (#40275) 929b445d0a [tune] remove test_client.py (#40415) b3c14249e1 move serve release tests to civ2 (#40414) 5205a2d121 [ci] change to oss tags (#40428) f10c25914d Revert "[docker image] use buildkit to build ray image (#40365)" (#40427) 11fb194050 [Data] Move BlockWritePathProvider to separate file (#40302) 56b72a542a [Data] Remove out-of-date Data examples (#40127) c91ee0fecc [tune] remove TuneRichReporter (#40169) 4d0c05b00e release test infra still relies on python 3.7 (#40407) 9d9e7c37c1 [serve] Clean up test_metrics.py::test_queued_queries_disconnect (#40410) 58b26141d8 [serve] Migrate v1 api release tests (#40372) ff58667da5 [serve] Outdated API cleanup in docs (#40404) c16082fa01 [ci] add special tag for ray and ray-ml image steps (#40394) bc80271c5e Update CODEOWNERS (#40268) ac73b159d6 [serve] Fix flaky test_autoscaling_policy on windows (#40411) c4ad3d5a2a [jobs] Fix recovery race condition in JobManager (#40068) 7bd1d3a0b2 [Data] Deprecate extraneous Dataset parameters and methods (#40385) 405e82ab64 [RLlib] Issue deprecation warnings for all rllib_contrib algos. (#40147) e28e2a65a6 [serve] Deprecate DAG API (#40290) f787843b7f [Doc][KubeRay] Add a section for Redis cleanup (#40308) 71e893c59e [doc] Add vSphere version requirement in user guide (#40284) 67ec4476e3 [Doc] Logging: Add Fluent Bit DaemonSet and CloudWatch for persistence (#39895) bde327fcdc [Jobs] append error trace to job driver logs (#40380) a8b24dae32 [Serve] Fix the benchmark import error (#40381) 574eb54bee [serve] Migrate v1 api tests (#40363) 16da48491e [Core] Introduce AcceleratorManager interface (#40286) 56affb7e4b [RLlib] Fix BC release test failure. (#40371) dc944fe7d9 [Dependencies] Remove pickle5 backport (#40338) 58cd807bf5 [dashboard] Remove /api/snapshot endpoint (#40269) a2ef28db16 [Core] Bugfix/runtime agent binding (#40092) (#40311) 8fa1565053 [deflakey] Deflakey test_redis_tls (#40378) 820aad1836 [civ2][gpu/2] migrate ml gpu tests to civ2 (#40362) ad7e1fc2ee [Train] Deprecate TransformersTrainer (#40277) 89eb6da181 [Train] Update checkpoint path for RayTrainReportCallbacks. (#40174) dd6eb71fdd [Dependencies] Remove typing_extensions (#40336) 4113ab42bc [runtime env]: Integrating Nsight to Ray worker process (#39998) c6baff26d7 [Train] Fix lightning 2.0 import path (#40266) 199b6cacdf [RLlib][Docs] Add mobile-env to RLlib community examples (#37641) 1a286fd255 [Train] Deprecate AccelerateTrainer (#40274) b3c7af543b fix (#40374) 40275a944e [KubeRay][Autoscaler] Make KubeRay CRD version configurable (#40357) 5c3f100dc3 [serve] Deprecate single app config (#40329) 0c06bb9894 [data] store ray dashboard metrics in _StatsActor (#40118) 941ac71e43 [serve] Fix deploy config edge case bug (#40326) 3d2d4fe816 [data] Allow setting target max block size per-op instead of per-Dataset and reduce for streaming maps (#39710) da5046e76a [docker image] use buildkit to build ray image (#40365) d9e24f2d59 [civ2][gpu/1] create ml gpu builds (#40322) 563a9bf32a Jail //python/ray/tests:test_redis_tls (#40366) 09d4f0ab72 [Data] Fix return type and docstring for iter APIs (#40361) c49b8ed244 [Data] Fix documentation link for local shuffle (#40291) 8310ce11df [Data] Remove BulkExecutor code path (#40200) 56337e04b7 [data] Add function arg params to map and flat_map (#40010) ba581a3d60 [serve] Initial pydantic>=2.0 compatibility (#40222) b31a5aaf0d [serve] Remove v1 api (#40218) 4ab0ba0823 [Data] Remove FileMetadataShuffler (#40341) 306c71438c [Doc] Streaming generator alpha doc (#39914) 8d286f03ce [RLlib-contrib] Dreamer(V1) (won't be moved into rllib_contrib, b/c we now have DreamerV3). (#36621) f097cd4512 [RLlib] Remove some deprecation warnings that should not be there. (#39984)

rickyyx commented 1 year ago

So it's confirmed https://github.com/ray-project/ray/commit/3e8278dc8809274a7c324797897223b8a5b8bc5b is the root cause.

Reason:

We started tracking all lost tasks in the PR at GCS (so that we could report data loss at task level granularity)
With more than 100M actor tasks being generated throughout the lifetime of the job, these become larger and larger
We do have GC of such info, but only when the job finishes.
As a result, when a long running job generated many tasks, it will explode GCS memory gradually.

The only unknown is why on the metric page, this doesn't show up as a slowly increase but a burst. But the bisection is kind of conclusive:

Pass before this commit: https://buildkite.com/ray-project/release-tests-branch/builds/2326
Fails at this commit with OOM: https://buildkite.com/ray-project/release-tests-branch/builds/2327

vitsai commented 1 year ago

This is merged into release branch, can we close it soon?

ray-project / ray

Release test long_running_many_actor_tasks failed #40568