ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.24k stars 5.81k forks source link

[weekly release] many_ppo weekly has memory leak causing gRPC deadline exceeds #19309

Closed xwjiang2010 closed 2 years ago

xwjiang2010 commented 3 years ago

Search before asking

Ray Component

Ray Core

What happened + What you expected to happen

Weekly many_ppo is failing due to grpc DEADLINE_EXCEEDED error. https://buildkite.com/ray-project/periodic-ci/builds/1207#69e7d975-4806-497c-bd14-0eaeba2fde30 Session log: https://beta.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_SkryRYXzKpiLTR6BeGs9is7B?command-history-section=command_history

Is my understanding correct that Ray Core uses gRPC for inter-node communication? Is this failure more of a problem of Ray Core not recovering from inter-node failures or this is something that applications should handle?

This is causing our weekly release test to be red, potentially delaying 1.8. Please help!

Reproduction script

NA

Anything else

No response

Are you willing to submit a PR?

xwjiang2010 commented 3 years ago

cc @matthewdeng @krfricke

rkooo567 commented 3 years ago

This must be the root cause;

(pid=772620) 2021-10-10 17:05:27,811    ERROR worker.py:425 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=772620, ip=172.31.87.239)
(pid=772620) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-87-239 is used (29.42 / 30.58 GB). The top 10 memory consumers are:
(pid=772620) 
(pid=772620) PID        MEM     COMMAND
(pid=772620) 144        6.54GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=172.
(pid=772620) 1105       6.41GiB python workloads/many_ppo.py
(pid=772620) 160        6.08GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dash
(pid=772620) 139        2.74GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
(pid=772620) 134        0.58GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
(pid=772620) 772556     0.52GiB ray::PPO.__init__()
(pid=772620) 772611     0.5GiB  ray::IDLE
(pid=772620) 772614     0.5GiB  ray::IDLE
(pid=772620) 772609     0.5GiB  ray::IDLE
(pid=772620) 772610     0.5GiB  ray::IDLE

Related: https://github.com/ray-project/ray/issues/18541#issuecomment-931470575

My guess is the high memory usage causes some components to fail which causes the autoscaler to be crashed (deadline exceed can happen when the component is terminated)

rkooo567 commented 3 years ago

cc @gjoliver I remember you guys mentioned there was the memory leak issue in rllib. Do you think it is the same issue here?

gjoliver commented 3 years ago

We found a memory leak, which made things a lot better. but it's inside tf2 eager compute graph, and was leaking a small amount every iteration (many_ppo only runs 1 iter / stack). the fix is already merged, so if it's still happening, It's probably not related to memory leak from rllib.

scv119 commented 3 years ago

@xwjiang2010 looks a rlib memory leak issue? ^ if that's the case free to assign to relevant owner.

xwjiang2010 commented 3 years ago

hmmm, as @gjoliver mentioned, I believe the RLlib memory leak issue was already fixed by this PR. This weekly release test is run at master commit 635010d460ba266b56fc857e56af8272ae08df8c. It's after Sven's fix for memory leak was landed. Maybe some other memory leaks?

rkooo567 commented 3 years ago

Hmm in this case, it is possible the issue is from the ray core. I can take a look next sprint if that’s the case

scv119 commented 3 years ago

Reading the issue in #18541

1083    8.64GiB python workloads/many_ppo.py
155 5.71GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/dash
141 4.46GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=172.
136 3.29GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
131 0.77GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
540673  0.51GiB ray::PPO
540789  0.48GiB ray::RolloutWorker
540784  0.48GiB ray::RolloutWorker
540792  0.48GiB ray::RolloutWorker
540793  0.48GiB ray::RolloutWorker

Looks what's similar is many_ppo.py and dashboard had a lot of memory usage. @gjoliver did you have a chance to figure out the memory allocation of many_ppo.py? dashboard/dashboard.py looks pretty bad too. gcs_server also used a lot of memory. this looks abnormal to me @rkooo567

also to @xwjiang2010 is this a regression or a new test? our best bet is to reproduce this with memory profiler.

xwjiang2010 commented 3 years ago

@krfricke Do you know if we have other long running weekly tests at the similar scale maybe to compare the memory consumption with? @sven1977 Do you have suggestions as you have recently debugged (and found :) ) a tricky memory leak?

gjoliver commented 3 years ago

Reading the issue in #18541

1083  8.64GiB python workloads/many_ppo.py
155   5.71GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/dash
141   4.46GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=172.
136   3.29GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
131   0.77GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
540673    0.51GiB ray::PPO
540789    0.48GiB ray::RolloutWorker
540784    0.48GiB ray::RolloutWorker
540792    0.48GiB ray::RolloutWorker
540793    0.48GiB ray::RolloutWorker

Looks what's similar is many_ppo.py and dashboard had a lot of memory usage. @gjoliver did you have a chance to figure out the memory allocation of many_ppo.py? dashboard/dashboard.py looks pretty bad too. gcs_server also used a lot of memory. this looks abnormal to me @rkooo567

also to @xwjiang2010 is this a regression or a new test? our best bet is to reproduce this with memory profiler.

I took a look, and it didn't seem like a leak in rllib code. what many_ppo does is basically running the same pytorch stack 10K times, each time train only 1 step. @xwjiang2010, actually do you know if it's possible for Tune to hold onto all 10K trials and not release them?

that being said, @sven1977 has been using a secret tool really effectively in finding the leak in our tf2 eager graph. I am sure he will be able to help take a look.

rkooo567 commented 3 years ago

So, for the gcs server, given it holds data that it stores to Redis in memory, the memory size probably makes sense (+1gb compared to Redis). Not sure if Redis memory usage makes sense, and I will do profiling for this.

dashboard memory is a well known issue.

Many_ppo could be core worker OR application. Since jun solves the rllib issue again, I will try profiling as well, and I think it will be more obvious with the profiling result.

rkooo567 commented 3 years ago

@scv119 will run the memory profiling, and he would let us know more details about the memory usage!

xwjiang2010 commented 3 years ago

@gjoliver I wouldn't be surprised if that's the case. @scv119 Can you grab me when you do it? Would love to learn the flow.

scv119 commented 3 years ago

cc @xwjiang2010 https://docs.google.com/document/d/1o6n3vDltJnekY4GAfm_pScHovwxC71VlJ19gU0hw5FM/edit#heading=h.wvgzvmfxc4e6

scv119 commented 3 years ago

I also captured a memory profile some point last night profile.pdf

gjoliver commented 3 years ago

how do you read it? 98% in jemalloc/internal/tsd.h?

scv119 commented 3 years ago

Yeah it's not very obvious what's leaking. What I can tell is most of allocated memory is by python not C++. Let me report once I have large heap (like 5-6GiB)

rkooo567 commented 3 years ago

If the leak is from python, we can just wrap around the many ppo script with the python memory profiler

rkooo567 commented 3 years ago

https://github.com/pythonprofilers/memory_profiler

rkooo567 commented 3 years ago

This might provide a better insight?

rkooo567 commented 3 years ago

@krfricke

def update_last_result(self, result, terminate=False):
        if self.experiment_tag:
            result.update(experiment_tag=self.experiment_tag)

        self.set_location(Location(result.get(NODE_IP), result.get(PID)))
        self.last_result = result
        self.last_update_time = time.time()

        for metric, value in flatten_dict(result).items():
            if isinstance(value, Number):
                if metric not in self.metric_analysis:
                    self.metric_analysis[metric] = {
                        "max": value,
                        "min": value,
                        "avg": value,
                        "last": value
                    }
                    self.metric_n_steps[metric] = {}
                    for n in self.n_steps:
                        key = "last-{:d}-avg".format(n)
                        self.metric_analysis[metric][key] = value
                        # Store n as string for correct restore.
                        self.metric_n_steps[metric][str(n)] = deque(
                            [value], maxlen=n)

Is this

                        self.metric_n_steps[metric][str(n)] = deque(
                            [value], maxlen=n)

accumulated over time?

rkooo567 commented 3 years ago

@scv119 It also seems like gcs server memory usage is at first around 100~MB.

I believe this workload doesn't "scale up", but it just repeats the same workload (cc @gjoliver can you confirm)?

That says the gcs server also seems to have some sort of leak.

scv119 commented 3 years ago

another profile I got after running for 10+ hours. Looks mostly just python objects? profile.pdf )

rkooo567 commented 3 years ago

+1 on @scv119 I think it is the application leak. I am trying to run the test with https://docs.python.org/3/library/tracemalloc.html#compute-differences and see if it can catches some memory issue; (not sure if this is the best way for python memory profiling).

This is the first run with a mistake (I took the snapshot at the end of the test) after 1h 30m run (1000 trials) https://beta.anyscale.com/o/anyscale-internal/projects/prj_3qqS82y6R2UUTWG4oHeMpF/clusters/ses_1qyvtEPcT2MBeghFwRMptAz2?command-history-section=command_history

/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:679: size=202 MiB, count=1013467, average=209 B
/home/ray/anaconda3/lib/python3.7/json/encoder.py:202: size=123 MiB, count=1000, average=126 KiB
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/registry.py:168: size=90.6 MiB, count=919105, average=103 B
<frozen importlib._bootstrap_external>:525: size=90.1 MiB, count=733708, average=129 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:676: size=39.5 MiB, count=140000, average=296 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py:179: size=30.5 MiB, count=432186, average=74 B
/home/ray/anaconda3/lib/python3.7/abc.py:126: size=20.8 MiB, count=97324, average=224 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/serialization.py:41: size=20.2 MiB, count=206284, average=103 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:675: size=15.9 MiB, count=280000, average=60 B
/home/ray/anaconda3/lib/python3.7/json/decoder.py:353: size=14.6 MiB, count=165678, average=92 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:673: size=12.4 MiB, count=105636, average=123 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py:523: size=11.2 MiB, count=109515, average=107 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/ml_utils/dict.py:105: size=9260 KiB, count=121000, average=78 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py:159: size=8778 KiB, count=91699, average=98 B
/home/ray/anaconda3/lib/python3.7/copy.py:275: size=6775 KiB, count=83188, average=83 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py:444: size=6708 KiB, count=47699, average=144 B
<frozen importlib._bootstrap>:219: size=5278 KiB, count=48427, average=112 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/serialization.py:19: size=4690 KiB, count=46017, average=104 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:671: size=4524 KiB, count=1001, average=4627 B
/home/ray/anaconda3/lib/python3.7/copy.py:238: size=4221 KiB, count=27173, average=159 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/signature.py:71: size=3746 KiB, count=55364, average=69 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:439: size=3590 KiB, count=46817, average=79 B
/home/ray/anaconda3/lib/python3.7/types.py:70: size=2794 KiB, count=15097, average=190 B
/home/ray/anaconda3/lib/python3.7/functools.py:60: size=2639 KiB, count=25716, average=105 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:574: size=2129 KiB, count=30730, average=71 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/debug.py:24: size=2048 KiB, count=1, average=2048 KiB
/home/ray/anaconda3/lib/python3.7/inspect.py:2511: size=1962 KiB, count=16480, average=122 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py:217: size=1937 KiB, count=2569, average=772 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py:525: size=1837 KiB, count=857, average=2195 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py:244: size=1836 KiB, count=855, average=2199 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py:238: size=1836 KiB, count=855, average=2199 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle.py:707: size=1785 KiB, count=28559, average=64 B
<frozen importlib._bootstrap_external>:59: size=1605 KiB, count=10991, average=150 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/nodes.py:29: size=1527 KiB, count=19545, average=80 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:472: size=1519 KiB, count=2700, average=576 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2154: size=1380 KiB, count=17509, average=81 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2859: size=1379 KiB, count=16101, average=88 B
/home/ray/anaconda3/lib/python3.7/types.py:107: size=1290 KiB, count=16291, average=81 B
/home/ray/anaconda3/lib/python3.7/site-packages/torch/__init__.py:537: size=1282 KiB, count=6681, average=197 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2797: size=1271 KiB, count=30819, average=42 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:450: size=1239 KiB, count=3341, average=380 B
/home/ray/anaconda3/lib/python3.7/weakref.py:409: size=1204 KiB, count=10605, average=116 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py:378: size=1173 KiB, count=4812, average=250 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py:1041: size=1156 KiB, count=4240, average=279 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2760: size=1113 KiB, count=7907, average=144 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/signature.py:77: size=1045 KiB, count=47699, average=22 B
/home/ray/anaconda3/lib/python3.7/collections/__init__.py:397: size=1021 KiB, count=9762, average=107 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:666: size=1015 KiB, count=16240, average=64 B
/home/ray/anaconda3/lib/python3.7/linecache.py:137: size=994 KiB, count=9761, average=104 B
/home/ray/anaconda3/lib/python3.7/weakref.py:337: size=986 KiB, count=10522, average=96 B
/home/ray/anaconda3/lib/python3.7/collections/__init__.py:466: size=928 KiB, count=3781, average=251 B
/home/ray/anaconda3/lib/python3.7/abc.py:127: size=927 KiB, count=13275, average=71 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:559: size=883 KiB, count=6273, average=144 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/constructor.py:411: size=812 KiB, count=9861, average=84 B
/home/ray/anaconda3/lib/python3.7/types.py:74: size=769 KiB, count=10933, average=72 B
<string>:1: size=755 KiB, count=6891, average=112 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:580: size=736 KiB, count=7840, average=96 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/error.py:7: size=714 KiB, count=9136, average=80 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:254: size=663 KiB, count=999, average=680 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/events.py:67: size=625 KiB, count=8001, average=80 B
<frozen importlib._bootstrap_external>:916: size=611 KiB, count=5721, average=109 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py:517: size=607 KiB, count=7773, average=80 B
/home/ray/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/tf_export.py:346: size=607 KiB, count=3962, average=157 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2161: size=596 KiB, count=7578, average=81 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2800: size=561 KiB, count=7468, average=77 B
/home/ray/anaconda3/lib/python3.7/site-packages/scipy/_lib/doccer.py:66: size=559 KiB, count=187, average=3063 B
/home/ray/anaconda3/lib/python3.7/site-packages/jax/_src/numpy/util.py:150: size=552 KiB, count=504, average=1121 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/tokens.py:4: size=546 KiB, count=6995, average=80 B
<frozen importlib._bootstrap_external>:606: size=541 KiB, count=7833, average=71 B
<frozen importlib._bootstrap>:371: size=537 KiB, count=6873, average=80 B
/home/ray/anaconda3/lib/python3.7/types.py:120: size=514 KiB, count=3258, average=162 B
/home/ray/anaconda3/lib/python3.7/typing.py:708: size=504 KiB, count=2526, average=204 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/state.py:217: size=497 KiB, count=21174, average=24 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2856: size=492 KiB, count=7873, average=64 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2802: size=486 KiB, count=8891, average=56 B
<frozen importlib._bootstrap_external>:1416: size=485 KiB, count=907, average=547 B
<frozen importlib._bootstrap_external>:1408: size=451 KiB, count=7133, average=65 B
/home/ray/anaconda3/lib/python3.7/multiprocessing/queues.py:89: size=433 KiB, count=839, average=528 B
<frozen importlib._bootstrap>:36: size=424 KiB, count=4955, average=88 B
/home/ray/anaconda3/lib/python3.7/site-packages/botocore/client.py:369: size=420 KiB, count=1454, average=296 B
/home/ray/anaconda3/lib/python3.7/site-packages/google/protobuf/message.py:385: size=409 KiB, count=5232, average=80 B
/home/ray/anaconda3/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:701: size=406 KiB, count=4211, average=99 B
/home/ray/anaconda3/lib/python3.7/threading.py:348: size=400 KiB, count=776, average=528 B
/home/ray/anaconda3/lib/python3.7/site-packages/tensorboardX/record_writer.py:61: size=361 KiB, count=3919, average=94 B
/home/ray/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/decorator_utils.py:114: size=359 KiB, count=330, average=1114 B
/home/ray/anaconda3/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:102: size=341 KiB, count=2524, average=138 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/events.py:24: size=332 KiB, count=4252, average=80 B
/home/ray/anaconda3/lib/python3.7/posixpath.py:92: size=326 KiB, count=3056, average=109 B
/home/ray/anaconda3/lib/python3.7/weakref.py:433: size=315 KiB, count=4003, average=81 B
<frozen importlib._bootstrap_external>:1352: size=314 KiB, count=5018, average=64 B
<frozen importlib._bootstrap>:971: size=308 KiB, count=1606, average=196 B
/home/ray/anaconda3/lib/python3.7/site-packages/botocore/docs/docstring.py:40: size=306 KiB, count=4861, average=64 B
/home/ray/anaconda3/lib/python3.7/site-packages/anyscale/sdk/anyscale_client/models/cluster_computes_query.py:21: size=290 KiB, count=13, average=22.3 KiB
<frozen importlib._bootstrap>:316: size=288 KiB, count=1, average=288 KiB
/home/ray/anaconda3/lib/python3.7/weakref.py:285: size=288 KiB, count=1, average=288 KiB
/home/ray/anaconda3/lib/python3.7/dataclasses.py:387: size=288 KiB, count=3194, average=92 B
<frozen importlib._bootstrap>:420: size=284 KiB, count=4124, average=71 B
/home/ray/anaconda3/lib/python3.7/site-packages/botocore/hooks.py:581: size=283 KiB, count=2396, average=121 B
<frozen importlib._bootstrap_external>:800: size=277 KiB, count=3544, average=80 B
<frozen importlib._bootstrap_external>:887: size=275 KiB, count=5940, average=47 B

I am running another round that can compare snaptshot of before / after the run (1000 trials that take about 1+hour) https://beta.anyscale.com/o/anyscale-internal/projects/prj_3qqS82y6R2UUTWG4oHeMpF/clusters/ses_dsZtr3J6LxNNjEvGGeLHSChg

rkooo567 commented 3 years ago

(actually the best way might be to take the snaptshot in the middle of the run, not at the end, e.g., doing the comparison within the callback)? Maybe @krfricke or @gjoliver can help here?

rkooo567 commented 3 years ago

I am running the test with tracemalloc snapshot comparison within the Tune callback. (in the command, you can search Top 30 ) https://beta.anyscale.com/o/anyscale-internal/projects/prj_3qqS82y6R2UUTWG4oHeMpF/clusters/ses_xkSXQ7QaTGdStETQLjZ4Ge7b?command-history-section=command_history

(but I am not sure if this captures the memory usage correctly because the driver memory usage seems to be 3+GB while the top memory usage from Tune is only 200MB)

Screen Shot 2021-10-14 at 8 28 40 AM
xwjiang2010 commented 3 years ago

Some updates:

  1. The PPO job with 2X memory node was launched. We want to see with a little relaxed memory constraints, if we can run the job to finish (24h).
  2. On profiling. I did some profiling using tracemalloc. Overall I got the same result as @rkooo567 The top offenders are mainly from trial.py:679/676, registry.py:168, json/encoder.py:202 and serialization.py.

After running ~1300 or so trials, the top ~17 offenders accumulated ~720MB, which translates to ~ 5.5GB if we scale to 10000 trials. This is about consistent with @rkooo567 's result (~6GB consumed by the driver process at the time of crash).

xwjiang2010 commented 3 years ago

many_ppo test passes on a 2X machine last night: https://buildkite.com/ray-project/periodic-ci/builds/1265

xwjiang2010 commented 3 years ago

For every trial, we are maintaining most notably two dictionaries: trial.metric_n_steps and trial.metric_analysis.

These are the content from the PPO run:

{'episode_reward_max': {'5': deque([87.0], maxlen=5), '10': deque([87.0], maxlen=10)}, 'episode_reward_min': {'5': deque([8.0], maxlen=5), '10': deque([8.0], maxlen=10)}, 'episode_reward_mean': {'5': deque([23.108187134502923], maxlen=5), '10': deque([23.108187134502923], maxlen=10)}, 'episode_len_mean': {'5': deque([23.108187134502923], maxlen=5), '10': deque([23.108187134502923], maxlen=10)}, 'episodes_this_iter': {'5': deque([342], maxlen=5), '10': deque([342], maxlen=10)}, 'num_healthy_workers': {'5': deque([7], maxlen=5), '10': deque([7], maxlen=10)}, 'timesteps_total': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'timesteps_this_iter': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'agent_timesteps_total': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'done': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'episodes_total': {'5': deque([342], maxlen=5), '10': deque([342], maxlen=10)}, 'training_iteration': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'timestamp': {'5': deque([1634746872], maxlen=5), '10': deque([1634746872], maxlen=10)}, 'time_this_iter_s': {'5': deque([2.7724995613098145], maxlen=5), '10': deque([2.7724995613098145], maxlen=10)}, 'time_total_s': {'5': deque([2.7724995613098145], maxlen=5), '10': deque([2.7724995613098145], maxlen=10)}, 'pid': {'5': deque([399], maxlen=5), '10': deque([399], maxlen=10)}, 'time_since_restore': {'5': deque([2.7724995613098145], maxlen=5), '10': deque([2.7724995613098145], maxlen=10)}, 'timesteps_since_restore': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'iterations_since_restore': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'sampler_perf/mean_raw_obs_processing_ms': {'5': deque([0.14265826358197845], maxlen=5), '10': deque([0.14265826358197845], maxlen=10)}, 'sampler_perf/mean_inference_ms': {'5': deque([1.5106409359515482], maxlen=5), '10': deque([1.5106409359515482], maxlen=10)}, 'sampler_perf/mean_action_processing_ms': {'5': deque([0.06872722420235718], maxlen=5), '10': deque([0.06872722420235718], maxlen=10)}, 'sampler_perf/mean_env_wait_ms': {'5': deque([0.08027102820580613], maxlen=5), '10': deque([0.08027102820580613], maxlen=10)}, 'sampler_perf/mean_env_render_ms': {'5': deque([0.0], maxlen=5), '10': deque([0.0], maxlen=10)}, 'timers/sample_time_ms': {'5': deque([2407.739], maxlen=5), '10': deque([2407.739], maxlen=10)}, 'timers/sample_throughput': {'5': deque([3320.128], maxlen=5), '10': deque([3320.128], maxlen=10)}, 'timers/load_time_ms': {'5': deque([0.453], maxlen=5), '10': deque([0.453], maxlen=10)}, 'timers/load_throughput': {'5': deque([17637699.198], maxlen=5), '10': deque([17637699.198], maxlen=10)}, 'timers/learn_time_ms': {'5': deque([362.371], maxlen=5), '10': deque([362.371], maxlen=10)}, 'timers/learn_throughput': {'5': deque([22060.237], maxlen=5), '10': deque([22060.237], maxlen=10)}, 'timers/update_time_ms': {'5': deque([1.931], maxlen=5), '10': deque([1.931], maxlen=10)}, 'info/num_steps_sampled': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'info/num_agent_steps_sampled': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'info/num_steps_trained': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'info/num_agent_steps_trained': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'config/num_workers': {'5': deque([7], maxlen=5), '10': deque([7], maxlen=10)}, 'config/num_envs_per_worker': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/create_env_on_driver': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/rollout_fragment_length': {'5': deque([571], maxlen=5), '10': deque([571], maxlen=10)}, 'config/gamma': {'5': deque([0.99], maxlen=5), '10': deque([0.99], maxlen=10)}, 'config/lr': {'5': deque([5e-05], maxlen=5), '10': deque([5e-05], maxlen=10)}, 'config/train_batch_size': {'5': deque([4000], maxlen=5), '10': deque([4000], maxlen=10)}, 'config/soft_horizon': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/no_done_at_end': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/remote_worker_envs': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/remote_env_batch_wait_ms': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/render_env': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/record_env': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/normalize_actions': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/clip_actions': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/ignore_worker_failures': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/log_sys_usage': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/fake_sampler': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/eager_tracing': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/explore': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/evaluation_num_episodes': {'5': deque([10], maxlen=5), '10': deque([10], maxlen=10)}, 'config/evaluation_parallel_to_training': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/in_evaluation': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/evaluation_num_workers': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/sample_async': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/synchronize_filters': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/compress_observations': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/collect_metrics_timeout': {'5': deque([180], maxlen=5), '10': deque([180], maxlen=10)}, 'config/metrics_smoothing_episodes': {'5': deque([100], maxlen=5), '10': deque([100], maxlen=10)}, 'config/min_iter_time_s': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/timesteps_per_iteration': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/num_gpus': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/_fake_gpus': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/num_cpus_per_worker': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/num_gpus_per_worker': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/num_cpus_for_driver': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/actions_in_input_normalized': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/postprocess_inputs': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/shuffle_buffer_size': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/output_max_file_size': {'5': deque([67108864], maxlen=5), '10': deque([67108864], maxlen=10)}, 'config/_tf_policy_handles_more_than_one_loss': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/_disable_preprocessor_api': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/simple_optimizer': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/monitor': {'5': deque([-1], maxlen=5), '10': deque([-1], maxlen=10)}, 'config/use_critic': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/use_gae': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/lambda': {'5': deque([1.0], maxlen=5), '10': deque([1.0], maxlen=10)}, 'config/kl_coeff': {'5': deque([0.2], maxlen=5), '10': deque([0.2], maxlen=10)}, 'config/sgd_minibatch_size': {'5': deque([128], maxlen=5), '10': deque([128], maxlen=10)}, 'config/shuffle_sequences': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/num_sgd_iter': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/vf_loss_coeff': {'5': deque([1.0], maxlen=5), '10': deque([1.0], maxlen=10)}, 'config/entropy_coeff': {'5': deque([0.0], maxlen=5), '10': deque([0.0], maxlen=10)}, 'config/clip_param': {'5': deque([0.3], maxlen=5), '10': deque([0.3], maxlen=10)}, 'config/vf_clip_param': {'5': deque([10.0], maxlen=5), '10': deque([10.0], maxlen=10)}, 'config/kl_target': {'5': deque([0.01], maxlen=5), '10': deque([0.01], maxlen=10)}, 'config/vf_share_layers': {'5': deque([-1], maxlen=5), '10': deque([-1], maxlen=10)}, 'perf/cpu_util_percent': {'5': deque([74.125], maxlen=5), '10': deque([74.125], maxlen=10)}, 'perf/ram_util_percent': {'5': deque([17.4], maxlen=5), '10': deque([17.4], maxlen=10)}, 'config/model/_use_default_native_models': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/_disable_preprocessor_api': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/free_log_std': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/no_final_linear': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/vf_share_layers': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/use_lstm': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/max_seq_len': {'5': deque([20], maxlen=5), '10': deque([20], maxlen=10)}, 'config/model/lstm_cell_size': {'5': deque([256], maxlen=5), '10': deque([256], maxlen=10)}, 'config/model/lstm_use_prev_action': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/lstm_use_prev_reward': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/_time_major': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/use_attention': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/attention_num_transformer_units': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/model/attention_dim': {'5': deque([64], maxlen=5), '10': deque([64], maxlen=10)}, 'config/model/attention_num_heads': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/model/attention_head_dim': {'5': deque([32], maxlen=5), '10': deque([32], maxlen=10)}, 'config/model/attention_memory_inference': {'5': deque([50], maxlen=5), '10': deque([50], maxlen=10)}, 'config/model/attention_memory_training': {'5': deque([50], maxlen=5), '10': deque([50], maxlen=10)}, 'config/model/attention_position_wise_mlp_dim': {'5': deque([32], maxlen=5), '10': deque([32], maxlen=10)}, 'config/model/attention_init_gru_gate_bias': {'5': deque([2.0], maxlen=5), '10': deque([2.0], maxlen=10)}, 'config/model/attention_use_n_prev_actions': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/model/attention_use_n_prev_rewards': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/model/framestack': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/model/dim': {'5': deque([84], maxlen=5), '10': deque([84], maxlen=10)}, 'config/model/grayscale': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/zero_mean': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/model/lstm_use_prev_action_reward': {'5': deque([-1], maxlen=5), '10': deque([-1], maxlen=10)}, 'config/tf_session_args/intra_op_parallelism_threads': {'5': deque([2], maxlen=5), '10': deque([2], maxlen=10)}, 'config/tf_session_args/inter_op_parallelism_threads': {'5': deque([2], maxlen=5), '10': deque([2], maxlen=10)}, 'config/tf_session_args/log_device_placement': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/tf_session_args/allow_soft_placement': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/local_tf_session_args/intra_op_parallelism_threads': {'5': deque([8], maxlen=5), '10': deque([8], maxlen=10)}, 'config/local_tf_session_args/inter_op_parallelism_threads': {'5': deque([8], maxlen=5), '10': deque([8], maxlen=10)}, 'config/multiagent/policy_map_capacity': {'5': deque([100], maxlen=5), '10': deque([100], maxlen=10)}, 'config/tf_session_args/gpu_options/allow_growth': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/tf_session_args/device_count/CPU': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'info/learner/default_policy/learner_stats/allreduce_latency': {'5': deque([0.0], maxlen=5), '10': deque([0.0], maxlen=10)}, 'info/learner/default_policy/learner_stats/cur_kl_coeff': {'5': deque([0.19999999999999993], maxlen=5), '10': deque([0.19999999999999993], maxlen=10)}, 'info/learner/default_policy/learner_stats/cur_lr': {'5': deque([5.0000000000000016e-05], maxlen=5), '10': deque([5.0000000000000016e-05], maxlen=10)}, 'info/learner/default_policy/learner_stats/total_loss': {'5': deque([289.1911126413653], maxlen=5), '10': deque([289.1911126413653], maxlen=10)}, 'info/learner/default_policy/learner_stats/policy_loss': {'5': deque([-0.014920547394262205], maxlen=5), '10': deque([-0.014920547394262205], maxlen=10)}, 'info/learner/default_policy/learner_stats/vf_loss': {'5': deque([289.2044420549947], maxlen=5), '10': deque([289.2044420549947], maxlen=10)}, 'info/learner/default_policy/learner_stats/vf_explained_var': {'5': deque([-3.1133813242758476e-05], maxlen=5), '10': deque([-3.1133813242758476e-05], maxlen=10)}, 'info/learner/default_policy/learner_stats/kl': {'5': deque([0.007963308195468869], maxlen=5), '10': deque([0.007963308195468869], maxlen=10)}, 'info/learner/default_policy/learner_stats/entropy': {'5': deque([0.6850801167949554], maxlen=5), '10': deque([0.6850801167949554], maxlen=10)}, 'info/learner/default_policy/learner_stats/entropy_coeff': {'5': deque([0.0], maxlen=5), '10': deque([0.0], maxlen=10)}}

and

{'episode_reward_max': {'max': 87.0, 'min': 87.0, 'avg': 87.0, 'last': 87.0, 'last-5-avg': 87.0, 'last-10-avg': 87.0}, 'episode_reward_min': {'max': 8.0, 'min': 8.0, 'avg': 8.0, 'last': 8.0, 'last-5-avg': 8.0, 'last-10-avg': 8.0}, 'episode_reward_mean': {'max': 23.108187134502923, 'min': 23.108187134502923, 'avg': 23.108187134502923, 'last': 23.108187134502923, 'last-5-avg': 23.108187134502923, 'last-10-avg': 23.108187134502923}, 'episode_len_mean': {'max': 23.108187134502923, 'min': 23.108187134502923, 'avg': 23.108187134502923, 'last': 23.108187134502923, 'last-5-avg': 23.108187134502923, 'last-10-avg': 23.108187134502923}, 'episodes_this_iter': {'max': 342, 'min': 342, 'avg': 342, 'last': 342, 'last-5-avg': 342, 'last-10-avg': 342}, 'num_healthy_workers': {'max': 7, 'min': 7, 'avg': 7, 'last': 7, 'last-5-avg': 7, 'last-10-avg': 7}, 'timesteps_total': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'timesteps_this_iter': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'agent_timesteps_total': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'done': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'episodes_total': {'max': 342, 'min': 342, 'avg': 342, 'last': 342, 'last-5-avg': 342, 'last-10-avg': 342}, 'training_iteration': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'timestamp': {'max': 1634746872, 'min': 1634746872, 'avg': 1634746872, 'last': 1634746872, 'last-5-avg': 1634746872, 'last-10-avg': 1634746872}, 'time_this_iter_s': {'max': 2.7724995613098145, 'min': 2.7724995613098145, 'avg': 2.7724995613098145, 'last': 2.7724995613098145, 'last-5-avg': 2.7724995613098145, 'last-10-avg': 2.7724995613098145}, 'time_total_s': {'max': 2.7724995613098145, 'min': 2.7724995613098145, 'avg': 2.7724995613098145, 'last': 2.7724995613098145, 'last-5-avg': 2.7724995613098145, 'last-10-avg': 2.7724995613098145}, 'pid': {'max': 399, 'min': 399, 'avg': 399, 'last': 399, 'last-5-avg': 399, 'last-10-avg': 399}, 'time_since_restore': {'max': 2.7724995613098145, 'min': 2.7724995613098145, 'avg': 2.7724995613098145, 'last': 2.7724995613098145, 'last-5-avg': 2.7724995613098145, 'last-10-avg': 2.7724995613098145}, 'timesteps_since_restore': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'iterations_since_restore': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'sampler_perf/mean_raw_obs_processing_ms': {'max': 0.14265826358197845, 'min': 0.14265826358197845, 'avg': 0.14265826358197845, 'last': 0.14265826358197845, 'last-5-avg': 0.14265826358197845, 'last-10-avg': 0.14265826358197845}, 'sampler_perf/mean_inference_ms': {'max': 1.5106409359515482, 'min': 1.5106409359515482, 'avg': 1.5106409359515482, 'last': 1.5106409359515482, 'last-5-avg': 1.5106409359515482, 'last-10-avg': 1.5106409359515482}, 'sampler_perf/mean_action_processing_ms': {'max': 0.06872722420235718, 'min': 0.06872722420235718, 'avg': 0.06872722420235718, 'last': 0.06872722420235718, 'last-5-avg': 0.06872722420235718, 'last-10-avg': 0.06872722420235718}, 'sampler_perf/mean_env_wait_ms': {'max': 0.08027102820580613, 'min': 0.08027102820580613, 'avg': 0.08027102820580613, 'last': 0.08027102820580613, 'last-5-avg': 0.08027102820580613, 'last-10-avg': 0.08027102820580613}, 'sampler_perf/mean_env_render_ms': {'max': 0.0, 'min': 0.0, 'avg': 0.0, 'last': 0.0, 'last-5-avg': 0.0, 'last-10-avg': 0.0}, 'timers/sample_time_ms': {'max': 2407.739, 'min': 2407.739, 'avg': 2407.739, 'last': 2407.739, 'last-5-avg': 2407.739, 'last-10-avg': 2407.739}, 'timers/sample_throughput': {'max': 3320.128, 'min': 3320.128, 'avg': 3320.128, 'last': 3320.128, 'last-5-avg': 3320.128, 'last-10-avg': 3320.128}, 'timers/load_time_ms': {'max': 0.453, 'min': 0.453, 'avg': 0.453, 'last': 0.453, 'last-5-avg': 0.453, 'last-10-avg': 0.453}, 'timers/load_throughput': {'max': 17637699.198, 'min': 17637699.198, 'avg': 17637699.198, 'last': 17637699.198, 'last-5-avg': 17637699.198, 'last-10-avg': 17637699.198}, 'timers/learn_time_ms': {'max': 362.371, 'min': 362.371, 'avg': 362.371, 'last': 362.371, 'last-5-avg': 362.371, 'last-10-avg': 362.371}, 'timers/learn_throughput': {'max': 22060.237, 'min': 22060.237, 'avg': 22060.237, 'last': 22060.237, 'last-5-avg': 22060.237, 'last-10-avg': 22060.237}, 'timers/update_time_ms': {'max': 1.931, 'min': 1.931, 'avg': 1.931, 'last': 1.931, 'last-5-avg': 1.931, 'last-10-avg': 1.931}, 'info/num_steps_sampled': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'info/num_agent_steps_sampled': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'info/num_steps_trained': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'info/num_agent_steps_trained': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'config/num_workers': {'max': 7, 'min': 7, 'avg': 7, 'last': 7, 'last-5-avg': 7, 'last-10-avg': 7}, 'config/num_envs_per_worker': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/create_env_on_driver': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/rollout_fragment_length': {'max': 571, 'min': 571, 'avg': 571, 'last': 571, 'last-5-avg': 571, 'last-10-avg': 571}, 'config/gamma': {'max': 0.99, 'min': 0.99, 'avg': 0.99, 'last': 0.99, 'last-5-avg': 0.99, 'last-10-avg': 0.99}, 'config/lr': {'max': 5e-05, 'min': 5e-05, 'avg': 5e-05, 'last': 5e-05, 'last-5-avg': 5e-05, 'last-10-avg': 5e-05}, 'config/train_batch_size': {'max': 4000, 'min': 4000, 'avg': 4000, 'last': 4000, 'last-5-avg': 4000, 'last-10-avg': 4000}, 'config/soft_horizon': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/no_done_at_end': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/remote_worker_envs': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/remote_env_batch_wait_ms': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/render_env': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/record_env': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/normalize_actions': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/clip_actions': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/ignore_worker_failures': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/log_sys_usage': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/fake_sampler': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/eager_tracing': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/explore': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/evaluation_num_episodes': {'max': 10, 'min': 10, 'avg': 10, 'last': 10, 'last-5-avg': 10, 'last-10-avg': 10}, 'config/evaluation_parallel_to_training': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/in_evaluation': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/evaluation_num_workers': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/sample_async': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/synchronize_filters': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/compress_observations': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/collect_metrics_timeout': {'max': 180, 'min': 180, 'avg': 180, 'last': 180, 'last-5-avg': 180, 'last-10-avg': 180}, 'config/metrics_smoothing_episodes': {'max': 100, 'min': 100, 'avg': 100, 'last': 100, 'last-5-avg': 100, 'last-10-avg': 100}, 'config/min_iter_time_s': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/timesteps_per_iteration': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/num_gpus': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/_fake_gpus': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/num_cpus_per_worker': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/num_gpus_per_worker': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/num_cpus_for_driver': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/actions_in_input_normalized': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/postprocess_inputs': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/shuffle_buffer_size': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/output_max_file_size': {'max': 67108864, 'min': 67108864, 'avg': 67108864, 'last': 67108864, 'last-5-avg': 67108864, 'last-10-avg': 67108864}, 'config/_tf_policy_handles_more_than_one_loss': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/_disable_preprocessor_api': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/simple_optimizer': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/monitor': {'max': -1, 'min': -1, 'avg': -1, 'last': -1, 'last-5-avg': -1, 'last-10-avg': -1}, 'config/use_critic': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/use_gae': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/lambda': {'max': 1.0, 'min': 1.0, 'avg': 1.0, 'last': 1.0, 'last-5-avg': 1.0, 'last-10-avg': 1.0}, 'config/kl_coeff': {'max': 0.2, 'min': 0.2, 'avg': 0.2, 'last': 0.2, 'last-5-avg': 0.2, 'last-10-avg': 0.2}, 'config/sgd_minibatch_size': {'max': 128, 'min': 128, 'avg': 128, 'last': 128, 'last-5-avg': 128, 'last-10-avg': 128}, 'config/shuffle_sequences': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/num_sgd_iter': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/vf_loss_coeff': {'max': 1.0, 'min': 1.0, 'avg': 1.0, 'last': 1.0, 'last-5-avg': 1.0, 'last-10-avg': 1.0}, 'config/entropy_coeff': {'max': 0.0, 'min': 0.0, 'avg': 0.0, 'last': 0.0, 'last-5-avg': 0.0, 'last-10-avg': 0.0}, 'config/clip_param': {'max': 0.3, 'min': 0.3, 'avg': 0.3, 'last': 0.3, 'last-5-avg': 0.3, 'last-10-avg': 0.3}, 'config/vf_clip_param': {'max': 10.0, 'min': 10.0, 'avg': 10.0, 'last': 10.0, 'last-5-avg': 10.0, 'last-10-avg': 10.0}, 'config/kl_target': {'max': 0.01, 'min': 0.01, 'avg': 0.01, 'last': 0.01, 'last-5-avg': 0.01, 'last-10-avg': 0.01}, 'config/vf_share_layers': {'max': -1, 'min': -1, 'avg': -1, 'last': -1, 'last-5-avg': -1, 'last-10-avg': -1}, 'perf/cpu_util_percent': {'max': 74.125, 'min': 74.125, 'avg': 74.125, 'last': 74.125, 'last-5-avg': 74.125, 'last-10-avg': 74.125}, 'perf/ram_util_percent': {'max': 17.4, 'min': 17.4, 'avg': 17.4, 'last': 17.4, 'last-5-avg': 17.4, 'last-10-avg': 17.4}, 'config/model/_use_default_native_models': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/_disable_preprocessor_api': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/free_log_std': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/no_final_linear': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/vf_share_layers': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/use_lstm': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/max_seq_len': {'max': 20, 'min': 20, 'avg': 20, 'last': 20, 'last-5-avg': 20, 'last-10-avg': 20}, 'config/model/lstm_cell_size': {'max': 256, 'min': 256, 'avg': 256, 'last': 256, 'last-5-avg': 256, 'last-10-avg': 256}, 'config/model/lstm_use_prev_action': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/lstm_use_prev_reward': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/_time_major': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/use_attention': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/attention_num_transformer_units': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/model/attention_dim': {'max': 64, 'min': 64, 'avg': 64, 'last': 64, 'last-5-avg': 64, 'last-10-avg': 64}, 'config/model/attention_num_heads': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/model/attention_head_dim': {'max': 32, 'min': 32, 'avg': 32, 'last': 32, 'last-5-avg': 32, 'last-10-avg': 32}, 'config/model/attention_memory_inference': {'max': 50, 'min': 50, 'avg': 50, 'last': 50, 'last-5-avg': 50, 'last-10-avg': 50}, 'config/model/attention_memory_training': {'max': 50, 'min': 50, 'avg': 50, 'last': 50, 'last-5-avg': 50, 'last-10-avg': 50}, 'config/model/attention_position_wise_mlp_dim': {'max': 32, 'min': 32, 'avg': 32, 'last': 32, 'last-5-avg': 32, 'last-10-avg': 32}, 'config/model/attention_init_gru_gate_bias': {'max': 2.0, 'min': 2.0, 'avg': 2.0, 'last': 2.0, 'last-5-avg': 2.0, 'last-10-avg': 2.0}, 'config/model/attention_use_n_prev_actions': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/model/attention_use_n_prev_rewards': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/model/framestack': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/model/dim': {'max': 84, 'min': 84, 'avg': 84, 'last': 84, 'last-5-avg': 84, 'last-10-avg': 84}, 'config/model/grayscale': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/zero_mean': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/model/lstm_use_prev_action_reward': {'max': -1, 'min': -1, 'avg': -1, 'last': -1, 'last-5-avg': -1, 'last-10-avg': -1}, 'config/tf_session_args/intra_op_parallelism_threads': {'max': 2, 'min': 2, 'avg': 2, 'last': 2, 'last-5-avg': 2, 'last-10-avg': 2}, 'config/tf_session_args/inter_op_parallelism_threads': {'max': 2, 'min': 2, 'avg': 2, 'last': 2, 'last-5-avg': 2, 'last-10-avg': 2}, 'config/tf_session_args/log_device_placement': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/tf_session_args/allow_soft_placement': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/local_tf_session_args/intra_op_parallelism_threads': {'max': 8, 'min': 8, 'avg': 8, 'last': 8, 'last-5-avg': 8, 'last-10-avg': 8}, 'config/local_tf_session_args/inter_op_parallelism_threads': {'max': 8, 'min': 8, 'avg': 8, 'last': 8, 'last-5-avg': 8, 'last-10-avg': 8}, 'config/multiagent/policy_map_capacity': {'max': 100, 'min': 100, 'avg': 100, 'last': 100, 'last-5-avg': 100, 'last-10-avg': 100}, 'config/tf_session_args/gpu_options/allow_growth': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/tf_session_args/device_count/CPU': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'info/learner/default_policy/learner_stats/allreduce_latency': {'max': 0.0, 'min': 0.0, 'avg': 0.0, 'last': 0.0, 'last-5-avg': 0.0, 'last-10-avg': 0.0}, 'info/learner/default_policy/learner_stats/cur_kl_coeff': {'max': 0.19999999999999993, 'min': 0.19999999999999993, 'avg': 0.19999999999999993, 'last': 0.19999999999999993, 'last-5-avg': 0.19999999999999993, 'last-10-avg': 0.19999999999999993}, 'info/learner/default_policy/learner_stats/cur_lr': {'max': 5.0000000000000016e-05, 'min': 5.0000000000000016e-05, 'avg': 5.0000000000000016e-05, 'last': 5.0000000000000016e-05, 'last-5-avg': 5.0000000000000016e-05, 'last-10-avg': 5.0000000000000016e-05}, 'info/learner/default_policy/learner_stats/total_loss': {'max': 289.1911126413653, 'min': 289.1911126413653, 'avg': 289.1911126413653, 'last': 289.1911126413653, 'last-5-avg': 289.1911126413653, 'last-10-avg': 289.1911126413653}, 'info/learner/default_policy/learner_stats/policy_loss': {'max': -0.014920547394262205, 'min': -0.014920547394262205, 'avg': -0.014920547394262205, 'last': -0.014920547394262205, 'last-5-avg': -0.014920547394262205, 'last-10-avg': -0.014920547394262205}, 'info/learner/default_policy/learner_stats/vf_loss': {'max': 289.2044420549947, 'min': 289.2044420549947, 'avg': 289.2044420549947, 'last': 289.2044420549947, 'last-5-avg': 289.2044420549947, 'last-10-avg': 289.2044420549947}, 'info/learner/default_policy/learner_stats/vf_explained_var': {'max': -3.1133813242758476e-05, 'min': -3.1133813242758476e-05, 'avg': -3.1133813242758476e-05, 'last': -3.1133813242758476e-05, 'last-5-avg': -3.1133813242758476e-05, 'last-10-avg': -3.1133813242758476e-05}, 'info/learner/default_policy/learner_stats/kl': {'max': 0.007963308195468869, 'min': 0.007963308195468869, 'avg': 0.007963308195468869, 'last': 0.007963308195468869, 'last-5-avg': 0.007963308195468869, 'last-10-avg': 0.007963308195468869}, 'info/learner/default_policy/learner_stats/entropy': {'max': 0.6850801167949554, 'min': 0.6850801167949554, 'avg': 0.6850801167949554, 'last': 0.6850801167949554, 'last-5-avg': 0.6850801167949554, 'last-10-avg': 0.6850801167949554}, 'info/learner/default_policy/learner_stats/entropy_coeff': {'max': 0.0, 'min': 0.0, 'avg': 0.0, 'last': 0.0, 'last-5-avg': 0.0, 'last-10-avg': 0.0}}

Python doesn't have a good way to recursively add all the nested object size. I did some break down and add up. Overall my calculation is for each trial we are looking at around 200KB from just the two dictionaries for this PPO training. This explains the memory accumulation at least from trial.py.

@gjoliver @krfricke Are these metrics all necessary? To me, things like config is pretty static and do we still want to keep it in metrics and do moving avg on it?

Still no idea on why registry.py is accumulating memory at line 168.

gjoliver commented 3 years ago

In my opinion this is a problem with the test, not tune. tracking 1MB of data for each Trial sounds perfectly reasonable to me. I wonder if we really need tune here. Would running RLlib in a loop for 10K times also serve the purpose?

xwjiang2010 commented 3 years ago

@gjoliver do you know who owns this test? Tune side or RLlib? What is the goal of it?

gjoliver commented 3 years ago

@gjoliver do you know who owns this test? Tune side or RLlib? What is the goal of it?

I actually think this is probably owned by core team for detecting leaks. Sang would know better.

rkooo567 commented 3 years ago

I think this is owned by the ML team (all long running test is owned by the ml team, but I will bring ownership of some tests to the core. This test wasn’t one of the candidates). Note that this test existed for a long time (it was there even before I joined it), so you can probably ping Eric or Richard about the quality of the test.

I think what we want at the end of the day is to

  1. Understand if There’s a leak or expected memory accumulation
  2. Decide If that is sth we should fix or not
gjoliver commented 3 years ago

The test was originally introduced to detect leaked dead actors. 10K trials seemed like an arbitrary choice at the time.

Spending effort to cut per-trial memory usage from 1MB to 500KB feels a bit weird. In reality nobody would run nearly as many trials for a tuning session.

How about we run this with 2X the memory, or simply cut the num_samples to 5000? This shouldn't cause any loss of utility for this test, and the arbitrary memory consumption would work out better with the limit of our test instance.

krfricke commented 3 years ago

That's what we did - we increased memory and the test passed, so we consider it not release blocking.

We still opened a simple PR in #19583 to cut memory usage for this test. We've seen users starting up to 100k trials in one tuning session, an 10k doesn't seem to be that rare as well.

xwjiang2010 commented 3 years ago

memory breakdown at the time of crash:

ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-35-26 is used (29.37 / 30.91 GB). The top 10 memory consumers are:
--
  |  
  | PID MEM COMMAND
  | 160 6.83GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dash
  | 1177    6.12GiB python workloads/many_ppo.py
  | 144 5.95GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=172.
  | 139 3.04GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
  | 134 0.57GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
  | 765631  0.52GiB ray::PPO
  | 765721  0.5GiB  ray::RolloutWorker
  | 765726  0.5GiB  ray::RolloutWorker
  | 765722  0.5GiB  ray::RolloutWorker
  | 765734  0.5GiB  ray::RolloutWorker

@richardliaw fyi

rkooo567 commented 3 years ago

@xwjiang2010 what's the commit of this wheel btw? I merged the dashboard memory optimization PR lately, and I wonder if the dashboard memory usage is still high after that PR. https://github.com/ray-project/ray/pull/19385

xwjiang2010 commented 3 years ago

@rkooo567 https://s3-us-west-2.amazonaws.com/ray-wheels/releases/1.8.0/afcb3e0a41268e08c14a435c239c52ae3ea0e681/ray-1.8.0-cp37-cp37m-manylinux2014_x86_64.whl I don't see your commit there. Do you want to cp yours to 1.8.0?

rkooo567 commented 3 years ago

Oh that's in 1.8. I think since we decided to mark it as a non release blocker, I think it is fine not to cherry pick this. Thanks for the clarification!

xwjiang2010 commented 3 years ago

Re @richardliaw Around 8000 out of 10K is finished.

richardliaw commented 3 years ago

OK this is not a Tune issue; i think the main leakage is in the dashboard or GCS and the app doesn't even take up half of the node memory.

richardliaw commented 3 years ago

@rkooo567 so the dashboard leakage will not be release blocking?

richardliaw commented 3 years ago

Main q for me is why GCS/redis leaking so much

scv119 commented 3 years ago

I think the dashboard memory leak should be fixed by this: https://github.com/ray-project/ray/pull/19385 GCS I believe @rkooo567 is looking into it right now.

richardliaw commented 3 years ago

Nice! Hope that got captured in 1.8 cut.

matthewdeng commented 3 years ago

This was not included in the 1.8.0 cut, do you want to cherry-pick it? @richardliaw @scv119

richardliaw commented 3 years ago

I actually think that's a good idea; @scv119 ?

rkooo567 commented 3 years ago

@richardliaw dashboard leakage has been there since the beginning of Ray history, so I am not sure if we should do that. I think risks are not so high though. cc @edoakes

rkooo567 commented 3 years ago

Main q for me is why GCS/redis leaking so much

We will start investigating it soon.

My guess is we don't remove actor entries from GCS at all, but we want to run memory profiling on gcs first.

rkooo567 commented 3 years ago

After this PR, the Redis memory usage has drastically decreased, and the test seems to pass. https://github.com/ray-project/ray/pull/19699

But it is pretty clear there are GCS server & dashboard memory leak that are not fixed. Should be related to cleaning up actor states from memory, but we need more investigation. But I think this is not a high priority item to fix for now.