Closed xwjiang2010 closed 2 years ago
cc @matthewdeng @krfricke
This must be the root cause;
(pid=772620) 2021-10-10 17:05:27,811 ERROR worker.py:425 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=772620, ip=172.31.87.239)
(pid=772620) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-87-239 is used (29.42 / 30.58 GB). The top 10 memory consumers are:
(pid=772620)
(pid=772620) PID MEM COMMAND
(pid=772620) 144 6.54GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=172.
(pid=772620) 1105 6.41GiB python workloads/many_ppo.py
(pid=772620) 160 6.08GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dash
(pid=772620) 139 2.74GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
(pid=772620) 134 0.58GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
(pid=772620) 772556 0.52GiB ray::PPO.__init__()
(pid=772620) 772611 0.5GiB ray::IDLE
(pid=772620) 772614 0.5GiB ray::IDLE
(pid=772620) 772609 0.5GiB ray::IDLE
(pid=772620) 772610 0.5GiB ray::IDLE
Related: https://github.com/ray-project/ray/issues/18541#issuecomment-931470575
My guess is the high memory usage causes some components to fail which causes the autoscaler to be crashed (deadline exceed can happen when the component is terminated)
cc @gjoliver I remember you guys mentioned there was the memory leak issue in rllib. Do you think it is the same issue here?
We found a memory leak, which made things a lot better. but it's inside tf2 eager compute graph, and was leaking a small amount every iteration (many_ppo only runs 1 iter / stack). the fix is already merged, so if it's still happening, It's probably not related to memory leak from rllib.
@xwjiang2010 looks a rlib memory leak issue? ^ if that's the case free to assign to relevant owner.
hmmm, as @gjoliver mentioned, I believe the RLlib memory leak issue was already fixed by this PR. This weekly release test is run at master commit 635010d460ba266b56fc857e56af8272ae08df8c. It's after Sven's fix for memory leak was landed. Maybe some other memory leaks?
Hmm in this case, it is possible the issue is from the ray core. I can take a look next sprint if that’s the case
Reading the issue in #18541
1083 8.64GiB python workloads/many_ppo.py
155 5.71GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/dash
141 4.46GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=172.
136 3.29GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
131 0.77GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
540673 0.51GiB ray::PPO
540789 0.48GiB ray::RolloutWorker
540784 0.48GiB ray::RolloutWorker
540792 0.48GiB ray::RolloutWorker
540793 0.48GiB ray::RolloutWorker
Looks what's similar is many_ppo.py
and dashboard
had a lot of memory usage.
@gjoliver did you have a chance to figure out the memory allocation of many_ppo.py
?
dashboard/dashboard.py
looks pretty bad too.
gcs_server
also used a lot of memory. this looks abnormal to me @rkooo567
also to @xwjiang2010 is this a regression or a new test? our best bet is to reproduce this with memory profiler.
@krfricke Do you know if we have other long running weekly tests at the similar scale maybe to compare the memory consumption with? @sven1977 Do you have suggestions as you have recently debugged (and found :) ) a tricky memory leak?
Reading the issue in #18541
1083 8.64GiB python workloads/many_ppo.py 155 5.71GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/dash 141 4.46GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=172. 136 3.29GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server * 131 0.77GiB /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server * 540673 0.51GiB ray::PPO 540789 0.48GiB ray::RolloutWorker 540784 0.48GiB ray::RolloutWorker 540792 0.48GiB ray::RolloutWorker 540793 0.48GiB ray::RolloutWorker
Looks what's similar is
many_ppo.py
anddashboard
had a lot of memory usage. @gjoliver did you have a chance to figure out the memory allocation ofmany_ppo.py
?dashboard/dashboard.py
looks pretty bad too.gcs_server
also used a lot of memory. this looks abnormal to me @rkooo567also to @xwjiang2010 is this a regression or a new test? our best bet is to reproduce this with memory profiler.
I took a look, and it didn't seem like a leak in rllib code. what many_ppo does is basically running the same pytorch stack 10K times, each time train only 1 step. @xwjiang2010, actually do you know if it's possible for Tune to hold onto all 10K trials and not release them?
that being said, @sven1977 has been using a secret tool really effectively in finding the leak in our tf2 eager graph. I am sure he will be able to help take a look.
So, for the gcs server, given it holds data that it stores to Redis in memory, the memory size probably makes sense (+1gb compared to Redis). Not sure if Redis memory usage makes sense, and I will do profiling for this.
dashboard memory is a well known issue.
Many_ppo could be core worker OR application. Since jun solves the rllib issue again, I will try profiling as well, and I think it will be more obvious with the profiling result.
@scv119 will run the memory profiling, and he would let us know more details about the memory usage!
@gjoliver I wouldn't be surprised if that's the case. @scv119 Can you grab me when you do it? Would love to learn the flow.
I also captured a memory profile some point last night profile.pdf
how do you read it? 98% in jemalloc/internal/tsd.h?
Yeah it's not very obvious what's leaking. What I can tell is most of allocated memory is by python not C++. Let me report once I have large heap (like 5-6GiB)
If the leak is from python, we can just wrap around the many ppo script with the python memory profiler
This might provide a better insight?
@krfricke
def update_last_result(self, result, terminate=False):
if self.experiment_tag:
result.update(experiment_tag=self.experiment_tag)
self.set_location(Location(result.get(NODE_IP), result.get(PID)))
self.last_result = result
self.last_update_time = time.time()
for metric, value in flatten_dict(result).items():
if isinstance(value, Number):
if metric not in self.metric_analysis:
self.metric_analysis[metric] = {
"max": value,
"min": value,
"avg": value,
"last": value
}
self.metric_n_steps[metric] = {}
for n in self.n_steps:
key = "last-{:d}-avg".format(n)
self.metric_analysis[metric][key] = value
# Store n as string for correct restore.
self.metric_n_steps[metric][str(n)] = deque(
[value], maxlen=n)
Is this
self.metric_n_steps[metric][str(n)] = deque(
[value], maxlen=n)
accumulated over time?
@scv119 It also seems like gcs server memory usage is at first around 100~MB.
I believe this workload doesn't "scale up", but it just repeats the same workload (cc @gjoliver can you confirm)?
That says the gcs server also seems to have some sort of leak.
another profile I got after running for 10+ hours. Looks mostly just python objects? profile.pdf )
+1 on @scv119 I think it is the application leak. I am trying to run the test with https://docs.python.org/3/library/tracemalloc.html#compute-differences and see if it can catches some memory issue; (not sure if this is the best way for python memory profiling).
This is the first run with a mistake (I took the snapshot at the end of the test) after 1h 30m run (1000 trials) https://beta.anyscale.com/o/anyscale-internal/projects/prj_3qqS82y6R2UUTWG4oHeMpF/clusters/ses_1qyvtEPcT2MBeghFwRMptAz2?command-history-section=command_history
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:679: size=202 MiB, count=1013467, average=209 B
/home/ray/anaconda3/lib/python3.7/json/encoder.py:202: size=123 MiB, count=1000, average=126 KiB
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/registry.py:168: size=90.6 MiB, count=919105, average=103 B
<frozen importlib._bootstrap_external>:525: size=90.1 MiB, count=733708, average=129 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:676: size=39.5 MiB, count=140000, average=296 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py:179: size=30.5 MiB, count=432186, average=74 B
/home/ray/anaconda3/lib/python3.7/abc.py:126: size=20.8 MiB, count=97324, average=224 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/serialization.py:41: size=20.2 MiB, count=206284, average=103 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:675: size=15.9 MiB, count=280000, average=60 B
/home/ray/anaconda3/lib/python3.7/json/decoder.py:353: size=14.6 MiB, count=165678, average=92 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:673: size=12.4 MiB, count=105636, average=123 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py:523: size=11.2 MiB, count=109515, average=107 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/ml_utils/dict.py:105: size=9260 KiB, count=121000, average=78 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/serialization.py:159: size=8778 KiB, count=91699, average=98 B
/home/ray/anaconda3/lib/python3.7/copy.py:275: size=6775 KiB, count=83188, average=83 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py:444: size=6708 KiB, count=47699, average=144 B
<frozen importlib._bootstrap>:219: size=5278 KiB, count=48427, average=112 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/serialization.py:19: size=4690 KiB, count=46017, average=104 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:671: size=4524 KiB, count=1001, average=4627 B
/home/ray/anaconda3/lib/python3.7/copy.py:238: size=4221 KiB, count=27173, average=159 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/signature.py:71: size=3746 KiB, count=55364, average=69 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:439: size=3590 KiB, count=46817, average=79 B
/home/ray/anaconda3/lib/python3.7/types.py:70: size=2794 KiB, count=15097, average=190 B
/home/ray/anaconda3/lib/python3.7/functools.py:60: size=2639 KiB, count=25716, average=105 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:574: size=2129 KiB, count=30730, average=71 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/debug.py:24: size=2048 KiB, count=1, average=2048 KiB
/home/ray/anaconda3/lib/python3.7/inspect.py:2511: size=1962 KiB, count=16480, average=122 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py:217: size=1937 KiB, count=2569, average=772 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py:525: size=1837 KiB, count=857, average=2195 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py:244: size=1836 KiB, count=855, average=2199 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py:238: size=1836 KiB, count=855, average=2199 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle.py:707: size=1785 KiB, count=28559, average=64 B
<frozen importlib._bootstrap_external>:59: size=1605 KiB, count=10991, average=150 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/nodes.py:29: size=1527 KiB, count=19545, average=80 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:472: size=1519 KiB, count=2700, average=576 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2154: size=1380 KiB, count=17509, average=81 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2859: size=1379 KiB, count=16101, average=88 B
/home/ray/anaconda3/lib/python3.7/types.py:107: size=1290 KiB, count=16291, average=81 B
/home/ray/anaconda3/lib/python3.7/site-packages/torch/__init__.py:537: size=1282 KiB, count=6681, average=197 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2797: size=1271 KiB, count=30819, average=42 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:450: size=1239 KiB, count=3341, average=380 B
/home/ray/anaconda3/lib/python3.7/weakref.py:409: size=1204 KiB, count=10605, average=116 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py:378: size=1173 KiB, count=4812, average=250 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py:1041: size=1156 KiB, count=4240, average=279 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2760: size=1113 KiB, count=7907, average=144 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/signature.py:77: size=1045 KiB, count=47699, average=22 B
/home/ray/anaconda3/lib/python3.7/collections/__init__.py:397: size=1021 KiB, count=9762, average=107 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:666: size=1015 KiB, count=16240, average=64 B
/home/ray/anaconda3/lib/python3.7/linecache.py:137: size=994 KiB, count=9761, average=104 B
/home/ray/anaconda3/lib/python3.7/weakref.py:337: size=986 KiB, count=10522, average=96 B
/home/ray/anaconda3/lib/python3.7/collections/__init__.py:466: size=928 KiB, count=3781, average=251 B
/home/ray/anaconda3/lib/python3.7/abc.py:127: size=927 KiB, count=13275, average=71 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:559: size=883 KiB, count=6273, average=144 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/constructor.py:411: size=812 KiB, count=9861, average=84 B
/home/ray/anaconda3/lib/python3.7/types.py:74: size=769 KiB, count=10933, average=72 B
<string>:1: size=755 KiB, count=6891, average=112 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py:580: size=736 KiB, count=7840, average=96 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/error.py:7: size=714 KiB, count=9136, average=80 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py:254: size=663 KiB, count=999, average=680 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/events.py:67: size=625 KiB, count=8001, average=80 B
<frozen importlib._bootstrap_external>:916: size=611 KiB, count=5721, average=109 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py:517: size=607 KiB, count=7773, average=80 B
/home/ray/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/tf_export.py:346: size=607 KiB, count=3962, average=157 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2161: size=596 KiB, count=7578, average=81 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2800: size=561 KiB, count=7468, average=77 B
/home/ray/anaconda3/lib/python3.7/site-packages/scipy/_lib/doccer.py:66: size=559 KiB, count=187, average=3063 B
/home/ray/anaconda3/lib/python3.7/site-packages/jax/_src/numpy/util.py:150: size=552 KiB, count=504, average=1121 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/tokens.py:4: size=546 KiB, count=6995, average=80 B
<frozen importlib._bootstrap_external>:606: size=541 KiB, count=7833, average=71 B
<frozen importlib._bootstrap>:371: size=537 KiB, count=6873, average=80 B
/home/ray/anaconda3/lib/python3.7/types.py:120: size=514 KiB, count=3258, average=162 B
/home/ray/anaconda3/lib/python3.7/typing.py:708: size=504 KiB, count=2526, average=204 B
/home/ray/anaconda3/lib/python3.7/site-packages/ray/state.py:217: size=497 KiB, count=21174, average=24 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2856: size=492 KiB, count=7873, average=64 B
/home/ray/anaconda3/lib/python3.7/inspect.py:2802: size=486 KiB, count=8891, average=56 B
<frozen importlib._bootstrap_external>:1416: size=485 KiB, count=907, average=547 B
<frozen importlib._bootstrap_external>:1408: size=451 KiB, count=7133, average=65 B
/home/ray/anaconda3/lib/python3.7/multiprocessing/queues.py:89: size=433 KiB, count=839, average=528 B
<frozen importlib._bootstrap>:36: size=424 KiB, count=4955, average=88 B
/home/ray/anaconda3/lib/python3.7/site-packages/botocore/client.py:369: size=420 KiB, count=1454, average=296 B
/home/ray/anaconda3/lib/python3.7/site-packages/google/protobuf/message.py:385: size=409 KiB, count=5232, average=80 B
/home/ray/anaconda3/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:701: size=406 KiB, count=4211, average=99 B
/home/ray/anaconda3/lib/python3.7/threading.py:348: size=400 KiB, count=776, average=528 B
/home/ray/anaconda3/lib/python3.7/site-packages/tensorboardX/record_writer.py:61: size=361 KiB, count=3919, average=94 B
/home/ray/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/decorator_utils.py:114: size=359 KiB, count=330, average=1114 B
/home/ray/anaconda3/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:102: size=341 KiB, count=2524, average=138 B
/home/ray/anaconda3/lib/python3.7/site-packages/yaml/events.py:24: size=332 KiB, count=4252, average=80 B
/home/ray/anaconda3/lib/python3.7/posixpath.py:92: size=326 KiB, count=3056, average=109 B
/home/ray/anaconda3/lib/python3.7/weakref.py:433: size=315 KiB, count=4003, average=81 B
<frozen importlib._bootstrap_external>:1352: size=314 KiB, count=5018, average=64 B
<frozen importlib._bootstrap>:971: size=308 KiB, count=1606, average=196 B
/home/ray/anaconda3/lib/python3.7/site-packages/botocore/docs/docstring.py:40: size=306 KiB, count=4861, average=64 B
/home/ray/anaconda3/lib/python3.7/site-packages/anyscale/sdk/anyscale_client/models/cluster_computes_query.py:21: size=290 KiB, count=13, average=22.3 KiB
<frozen importlib._bootstrap>:316: size=288 KiB, count=1, average=288 KiB
/home/ray/anaconda3/lib/python3.7/weakref.py:285: size=288 KiB, count=1, average=288 KiB
/home/ray/anaconda3/lib/python3.7/dataclasses.py:387: size=288 KiB, count=3194, average=92 B
<frozen importlib._bootstrap>:420: size=284 KiB, count=4124, average=71 B
/home/ray/anaconda3/lib/python3.7/site-packages/botocore/hooks.py:581: size=283 KiB, count=2396, average=121 B
<frozen importlib._bootstrap_external>:800: size=277 KiB, count=3544, average=80 B
<frozen importlib._bootstrap_external>:887: size=275 KiB, count=5940, average=47 B
I am running another round that can compare snaptshot of before / after the run (1000 trials that take about 1+hour) https://beta.anyscale.com/o/anyscale-internal/projects/prj_3qqS82y6R2UUTWG4oHeMpF/clusters/ses_dsZtr3J6LxNNjEvGGeLHSChg
(actually the best way might be to take the snaptshot in the middle of the run, not at the end, e.g., doing the comparison within the callback)? Maybe @krfricke or @gjoliver can help here?
I am running the test with tracemalloc snapshot comparison within the Tune callback. (in the command, you can search Top 30 ) https://beta.anyscale.com/o/anyscale-internal/projects/prj_3qqS82y6R2UUTWG4oHeMpF/clusters/ses_xkSXQ7QaTGdStETQLjZ4Ge7b?command-history-section=command_history
(but I am not sure if this captures the memory usage correctly because the driver memory usage seems to be 3+GB while the top memory usage from Tune is only 200MB)
Some updates:
After running ~1300 or so trials, the top ~17 offenders accumulated ~720MB, which translates to ~ 5.5GB if we scale to 10000 trials. This is about consistent with @rkooo567 's result (~6GB consumed by the driver process at the time of crash).
trial.py
trial.py offenders are purely in update_last_result
. @krfricke, if I understand it correctly, this is the data structure that maintains per trial result. And we keep a list of all trials in memory. So some accumulation of memory is expected here. But a few hundred KB per trial sounds a little high to me. wdyt?
registry.py
I am not expecting registry.py to accumulate memory in proportion to trial number. The data structure it maintains should be pretty static in my understanding. The line the tracer points at is pickle.loads(value)
. Could pickle leak anything?
many_ppo test passes on a 2X machine last night: https://buildkite.com/ray-project/periodic-ci/builds/1265
For every trial, we are maintaining most notably two dictionaries: trial.metric_n_steps
and trial.metric_analysis
.
These are the content from the PPO run:
{'episode_reward_max': {'5': deque([87.0], maxlen=5), '10': deque([87.0], maxlen=10)}, 'episode_reward_min': {'5': deque([8.0], maxlen=5), '10': deque([8.0], maxlen=10)}, 'episode_reward_mean': {'5': deque([23.108187134502923], maxlen=5), '10': deque([23.108187134502923], maxlen=10)}, 'episode_len_mean': {'5': deque([23.108187134502923], maxlen=5), '10': deque([23.108187134502923], maxlen=10)}, 'episodes_this_iter': {'5': deque([342], maxlen=5), '10': deque([342], maxlen=10)}, 'num_healthy_workers': {'5': deque([7], maxlen=5), '10': deque([7], maxlen=10)}, 'timesteps_total': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'timesteps_this_iter': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'agent_timesteps_total': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'done': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'episodes_total': {'5': deque([342], maxlen=5), '10': deque([342], maxlen=10)}, 'training_iteration': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'timestamp': {'5': deque([1634746872], maxlen=5), '10': deque([1634746872], maxlen=10)}, 'time_this_iter_s': {'5': deque([2.7724995613098145], maxlen=5), '10': deque([2.7724995613098145], maxlen=10)}, 'time_total_s': {'5': deque([2.7724995613098145], maxlen=5), '10': deque([2.7724995613098145], maxlen=10)}, 'pid': {'5': deque([399], maxlen=5), '10': deque([399], maxlen=10)}, 'time_since_restore': {'5': deque([2.7724995613098145], maxlen=5), '10': deque([2.7724995613098145], maxlen=10)}, 'timesteps_since_restore': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'iterations_since_restore': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'sampler_perf/mean_raw_obs_processing_ms': {'5': deque([0.14265826358197845], maxlen=5), '10': deque([0.14265826358197845], maxlen=10)}, 'sampler_perf/mean_inference_ms': {'5': deque([1.5106409359515482], maxlen=5), '10': deque([1.5106409359515482], maxlen=10)}, 'sampler_perf/mean_action_processing_ms': {'5': deque([0.06872722420235718], maxlen=5), '10': deque([0.06872722420235718], maxlen=10)}, 'sampler_perf/mean_env_wait_ms': {'5': deque([0.08027102820580613], maxlen=5), '10': deque([0.08027102820580613], maxlen=10)}, 'sampler_perf/mean_env_render_ms': {'5': deque([0.0], maxlen=5), '10': deque([0.0], maxlen=10)}, 'timers/sample_time_ms': {'5': deque([2407.739], maxlen=5), '10': deque([2407.739], maxlen=10)}, 'timers/sample_throughput': {'5': deque([3320.128], maxlen=5), '10': deque([3320.128], maxlen=10)}, 'timers/load_time_ms': {'5': deque([0.453], maxlen=5), '10': deque([0.453], maxlen=10)}, 'timers/load_throughput': {'5': deque([17637699.198], maxlen=5), '10': deque([17637699.198], maxlen=10)}, 'timers/learn_time_ms': {'5': deque([362.371], maxlen=5), '10': deque([362.371], maxlen=10)}, 'timers/learn_throughput': {'5': deque([22060.237], maxlen=5), '10': deque([22060.237], maxlen=10)}, 'timers/update_time_ms': {'5': deque([1.931], maxlen=5), '10': deque([1.931], maxlen=10)}, 'info/num_steps_sampled': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'info/num_agent_steps_sampled': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'info/num_steps_trained': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'info/num_agent_steps_trained': {'5': deque([7994], maxlen=5), '10': deque([7994], maxlen=10)}, 'config/num_workers': {'5': deque([7], maxlen=5), '10': deque([7], maxlen=10)}, 'config/num_envs_per_worker': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/create_env_on_driver': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/rollout_fragment_length': {'5': deque([571], maxlen=5), '10': deque([571], maxlen=10)}, 'config/gamma': {'5': deque([0.99], maxlen=5), '10': deque([0.99], maxlen=10)}, 'config/lr': {'5': deque([5e-05], maxlen=5), '10': deque([5e-05], maxlen=10)}, 'config/train_batch_size': {'5': deque([4000], maxlen=5), '10': deque([4000], maxlen=10)}, 'config/soft_horizon': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/no_done_at_end': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/remote_worker_envs': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/remote_env_batch_wait_ms': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/render_env': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/record_env': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/normalize_actions': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/clip_actions': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/ignore_worker_failures': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/log_sys_usage': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/fake_sampler': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/eager_tracing': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/explore': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/evaluation_num_episodes': {'5': deque([10], maxlen=5), '10': deque([10], maxlen=10)}, 'config/evaluation_parallel_to_training': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/in_evaluation': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/evaluation_num_workers': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/sample_async': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/synchronize_filters': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/compress_observations': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/collect_metrics_timeout': {'5': deque([180], maxlen=5), '10': deque([180], maxlen=10)}, 'config/metrics_smoothing_episodes': {'5': deque([100], maxlen=5), '10': deque([100], maxlen=10)}, 'config/min_iter_time_s': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/timesteps_per_iteration': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/num_gpus': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/_fake_gpus': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/num_cpus_per_worker': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/num_gpus_per_worker': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/num_cpus_for_driver': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/actions_in_input_normalized': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/postprocess_inputs': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/shuffle_buffer_size': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/output_max_file_size': {'5': deque([67108864], maxlen=5), '10': deque([67108864], maxlen=10)}, 'config/_tf_policy_handles_more_than_one_loss': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/_disable_preprocessor_api': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/simple_optimizer': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/monitor': {'5': deque([-1], maxlen=5), '10': deque([-1], maxlen=10)}, 'config/use_critic': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/use_gae': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/lambda': {'5': deque([1.0], maxlen=5), '10': deque([1.0], maxlen=10)}, 'config/kl_coeff': {'5': deque([0.2], maxlen=5), '10': deque([0.2], maxlen=10)}, 'config/sgd_minibatch_size': {'5': deque([128], maxlen=5), '10': deque([128], maxlen=10)}, 'config/shuffle_sequences': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/num_sgd_iter': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/vf_loss_coeff': {'5': deque([1.0], maxlen=5), '10': deque([1.0], maxlen=10)}, 'config/entropy_coeff': {'5': deque([0.0], maxlen=5), '10': deque([0.0], maxlen=10)}, 'config/clip_param': {'5': deque([0.3], maxlen=5), '10': deque([0.3], maxlen=10)}, 'config/vf_clip_param': {'5': deque([10.0], maxlen=5), '10': deque([10.0], maxlen=10)}, 'config/kl_target': {'5': deque([0.01], maxlen=5), '10': deque([0.01], maxlen=10)}, 'config/vf_share_layers': {'5': deque([-1], maxlen=5), '10': deque([-1], maxlen=10)}, 'perf/cpu_util_percent': {'5': deque([74.125], maxlen=5), '10': deque([74.125], maxlen=10)}, 'perf/ram_util_percent': {'5': deque([17.4], maxlen=5), '10': deque([17.4], maxlen=10)}, 'config/model/_use_default_native_models': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/_disable_preprocessor_api': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/free_log_std': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/no_final_linear': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/vf_share_layers': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/use_lstm': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/max_seq_len': {'5': deque([20], maxlen=5), '10': deque([20], maxlen=10)}, 'config/model/lstm_cell_size': {'5': deque([256], maxlen=5), '10': deque([256], maxlen=10)}, 'config/model/lstm_use_prev_action': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/lstm_use_prev_reward': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/_time_major': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/use_attention': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/attention_num_transformer_units': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/model/attention_dim': {'5': deque([64], maxlen=5), '10': deque([64], maxlen=10)}, 'config/model/attention_num_heads': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'config/model/attention_head_dim': {'5': deque([32], maxlen=5), '10': deque([32], maxlen=10)}, 'config/model/attention_memory_inference': {'5': deque([50], maxlen=5), '10': deque([50], maxlen=10)}, 'config/model/attention_memory_training': {'5': deque([50], maxlen=5), '10': deque([50], maxlen=10)}, 'config/model/attention_position_wise_mlp_dim': {'5': deque([32], maxlen=5), '10': deque([32], maxlen=10)}, 'config/model/attention_init_gru_gate_bias': {'5': deque([2.0], maxlen=5), '10': deque([2.0], maxlen=10)}, 'config/model/attention_use_n_prev_actions': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/model/attention_use_n_prev_rewards': {'5': deque([0], maxlen=5), '10': deque([0], maxlen=10)}, 'config/model/framestack': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/model/dim': {'5': deque([84], maxlen=5), '10': deque([84], maxlen=10)}, 'config/model/grayscale': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/model/zero_mean': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/model/lstm_use_prev_action_reward': {'5': deque([-1], maxlen=5), '10': deque([-1], maxlen=10)}, 'config/tf_session_args/intra_op_parallelism_threads': {'5': deque([2], maxlen=5), '10': deque([2], maxlen=10)}, 'config/tf_session_args/inter_op_parallelism_threads': {'5': deque([2], maxlen=5), '10': deque([2], maxlen=10)}, 'config/tf_session_args/log_device_placement': {'5': deque([False], maxlen=5), '10': deque([False], maxlen=10)}, 'config/tf_session_args/allow_soft_placement': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/local_tf_session_args/intra_op_parallelism_threads': {'5': deque([8], maxlen=5), '10': deque([8], maxlen=10)}, 'config/local_tf_session_args/inter_op_parallelism_threads': {'5': deque([8], maxlen=5), '10': deque([8], maxlen=10)}, 'config/multiagent/policy_map_capacity': {'5': deque([100], maxlen=5), '10': deque([100], maxlen=10)}, 'config/tf_session_args/gpu_options/allow_growth': {'5': deque([True], maxlen=5), '10': deque([True], maxlen=10)}, 'config/tf_session_args/device_count/CPU': {'5': deque([1], maxlen=5), '10': deque([1], maxlen=10)}, 'info/learner/default_policy/learner_stats/allreduce_latency': {'5': deque([0.0], maxlen=5), '10': deque([0.0], maxlen=10)}, 'info/learner/default_policy/learner_stats/cur_kl_coeff': {'5': deque([0.19999999999999993], maxlen=5), '10': deque([0.19999999999999993], maxlen=10)}, 'info/learner/default_policy/learner_stats/cur_lr': {'5': deque([5.0000000000000016e-05], maxlen=5), '10': deque([5.0000000000000016e-05], maxlen=10)}, 'info/learner/default_policy/learner_stats/total_loss': {'5': deque([289.1911126413653], maxlen=5), '10': deque([289.1911126413653], maxlen=10)}, 'info/learner/default_policy/learner_stats/policy_loss': {'5': deque([-0.014920547394262205], maxlen=5), '10': deque([-0.014920547394262205], maxlen=10)}, 'info/learner/default_policy/learner_stats/vf_loss': {'5': deque([289.2044420549947], maxlen=5), '10': deque([289.2044420549947], maxlen=10)}, 'info/learner/default_policy/learner_stats/vf_explained_var': {'5': deque([-3.1133813242758476e-05], maxlen=5), '10': deque([-3.1133813242758476e-05], maxlen=10)}, 'info/learner/default_policy/learner_stats/kl': {'5': deque([0.007963308195468869], maxlen=5), '10': deque([0.007963308195468869], maxlen=10)}, 'info/learner/default_policy/learner_stats/entropy': {'5': deque([0.6850801167949554], maxlen=5), '10': deque([0.6850801167949554], maxlen=10)}, 'info/learner/default_policy/learner_stats/entropy_coeff': {'5': deque([0.0], maxlen=5), '10': deque([0.0], maxlen=10)}}
and
{'episode_reward_max': {'max': 87.0, 'min': 87.0, 'avg': 87.0, 'last': 87.0, 'last-5-avg': 87.0, 'last-10-avg': 87.0}, 'episode_reward_min': {'max': 8.0, 'min': 8.0, 'avg': 8.0, 'last': 8.0, 'last-5-avg': 8.0, 'last-10-avg': 8.0}, 'episode_reward_mean': {'max': 23.108187134502923, 'min': 23.108187134502923, 'avg': 23.108187134502923, 'last': 23.108187134502923, 'last-5-avg': 23.108187134502923, 'last-10-avg': 23.108187134502923}, 'episode_len_mean': {'max': 23.108187134502923, 'min': 23.108187134502923, 'avg': 23.108187134502923, 'last': 23.108187134502923, 'last-5-avg': 23.108187134502923, 'last-10-avg': 23.108187134502923}, 'episodes_this_iter': {'max': 342, 'min': 342, 'avg': 342, 'last': 342, 'last-5-avg': 342, 'last-10-avg': 342}, 'num_healthy_workers': {'max': 7, 'min': 7, 'avg': 7, 'last': 7, 'last-5-avg': 7, 'last-10-avg': 7}, 'timesteps_total': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'timesteps_this_iter': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'agent_timesteps_total': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'done': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'episodes_total': {'max': 342, 'min': 342, 'avg': 342, 'last': 342, 'last-5-avg': 342, 'last-10-avg': 342}, 'training_iteration': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'timestamp': {'max': 1634746872, 'min': 1634746872, 'avg': 1634746872, 'last': 1634746872, 'last-5-avg': 1634746872, 'last-10-avg': 1634746872}, 'time_this_iter_s': {'max': 2.7724995613098145, 'min': 2.7724995613098145, 'avg': 2.7724995613098145, 'last': 2.7724995613098145, 'last-5-avg': 2.7724995613098145, 'last-10-avg': 2.7724995613098145}, 'time_total_s': {'max': 2.7724995613098145, 'min': 2.7724995613098145, 'avg': 2.7724995613098145, 'last': 2.7724995613098145, 'last-5-avg': 2.7724995613098145, 'last-10-avg': 2.7724995613098145}, 'pid': {'max': 399, 'min': 399, 'avg': 399, 'last': 399, 'last-5-avg': 399, 'last-10-avg': 399}, 'time_since_restore': {'max': 2.7724995613098145, 'min': 2.7724995613098145, 'avg': 2.7724995613098145, 'last': 2.7724995613098145, 'last-5-avg': 2.7724995613098145, 'last-10-avg': 2.7724995613098145}, 'timesteps_since_restore': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'iterations_since_restore': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'sampler_perf/mean_raw_obs_processing_ms': {'max': 0.14265826358197845, 'min': 0.14265826358197845, 'avg': 0.14265826358197845, 'last': 0.14265826358197845, 'last-5-avg': 0.14265826358197845, 'last-10-avg': 0.14265826358197845}, 'sampler_perf/mean_inference_ms': {'max': 1.5106409359515482, 'min': 1.5106409359515482, 'avg': 1.5106409359515482, 'last': 1.5106409359515482, 'last-5-avg': 1.5106409359515482, 'last-10-avg': 1.5106409359515482}, 'sampler_perf/mean_action_processing_ms': {'max': 0.06872722420235718, 'min': 0.06872722420235718, 'avg': 0.06872722420235718, 'last': 0.06872722420235718, 'last-5-avg': 0.06872722420235718, 'last-10-avg': 0.06872722420235718}, 'sampler_perf/mean_env_wait_ms': {'max': 0.08027102820580613, 'min': 0.08027102820580613, 'avg': 0.08027102820580613, 'last': 0.08027102820580613, 'last-5-avg': 0.08027102820580613, 'last-10-avg': 0.08027102820580613}, 'sampler_perf/mean_env_render_ms': {'max': 0.0, 'min': 0.0, 'avg': 0.0, 'last': 0.0, 'last-5-avg': 0.0, 'last-10-avg': 0.0}, 'timers/sample_time_ms': {'max': 2407.739, 'min': 2407.739, 'avg': 2407.739, 'last': 2407.739, 'last-5-avg': 2407.739, 'last-10-avg': 2407.739}, 'timers/sample_throughput': {'max': 3320.128, 'min': 3320.128, 'avg': 3320.128, 'last': 3320.128, 'last-5-avg': 3320.128, 'last-10-avg': 3320.128}, 'timers/load_time_ms': {'max': 0.453, 'min': 0.453, 'avg': 0.453, 'last': 0.453, 'last-5-avg': 0.453, 'last-10-avg': 0.453}, 'timers/load_throughput': {'max': 17637699.198, 'min': 17637699.198, 'avg': 17637699.198, 'last': 17637699.198, 'last-5-avg': 17637699.198, 'last-10-avg': 17637699.198}, 'timers/learn_time_ms': {'max': 362.371, 'min': 362.371, 'avg': 362.371, 'last': 362.371, 'last-5-avg': 362.371, 'last-10-avg': 362.371}, 'timers/learn_throughput': {'max': 22060.237, 'min': 22060.237, 'avg': 22060.237, 'last': 22060.237, 'last-5-avg': 22060.237, 'last-10-avg': 22060.237}, 'timers/update_time_ms': {'max': 1.931, 'min': 1.931, 'avg': 1.931, 'last': 1.931, 'last-5-avg': 1.931, 'last-10-avg': 1.931}, 'info/num_steps_sampled': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'info/num_agent_steps_sampled': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'info/num_steps_trained': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'info/num_agent_steps_trained': {'max': 7994, 'min': 7994, 'avg': 7994, 'last': 7994, 'last-5-avg': 7994, 'last-10-avg': 7994}, 'config/num_workers': {'max': 7, 'min': 7, 'avg': 7, 'last': 7, 'last-5-avg': 7, 'last-10-avg': 7}, 'config/num_envs_per_worker': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/create_env_on_driver': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/rollout_fragment_length': {'max': 571, 'min': 571, 'avg': 571, 'last': 571, 'last-5-avg': 571, 'last-10-avg': 571}, 'config/gamma': {'max': 0.99, 'min': 0.99, 'avg': 0.99, 'last': 0.99, 'last-5-avg': 0.99, 'last-10-avg': 0.99}, 'config/lr': {'max': 5e-05, 'min': 5e-05, 'avg': 5e-05, 'last': 5e-05, 'last-5-avg': 5e-05, 'last-10-avg': 5e-05}, 'config/train_batch_size': {'max': 4000, 'min': 4000, 'avg': 4000, 'last': 4000, 'last-5-avg': 4000, 'last-10-avg': 4000}, 'config/soft_horizon': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/no_done_at_end': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/remote_worker_envs': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/remote_env_batch_wait_ms': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/render_env': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/record_env': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/normalize_actions': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/clip_actions': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/ignore_worker_failures': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/log_sys_usage': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/fake_sampler': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/eager_tracing': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/explore': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/evaluation_num_episodes': {'max': 10, 'min': 10, 'avg': 10, 'last': 10, 'last-5-avg': 10, 'last-10-avg': 10}, 'config/evaluation_parallel_to_training': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/in_evaluation': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/evaluation_num_workers': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/sample_async': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/synchronize_filters': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/compress_observations': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/collect_metrics_timeout': {'max': 180, 'min': 180, 'avg': 180, 'last': 180, 'last-5-avg': 180, 'last-10-avg': 180}, 'config/metrics_smoothing_episodes': {'max': 100, 'min': 100, 'avg': 100, 'last': 100, 'last-5-avg': 100, 'last-10-avg': 100}, 'config/min_iter_time_s': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/timesteps_per_iteration': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/num_gpus': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/_fake_gpus': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/num_cpus_per_worker': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/num_gpus_per_worker': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/num_cpus_for_driver': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/actions_in_input_normalized': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/postprocess_inputs': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/shuffle_buffer_size': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/output_max_file_size': {'max': 67108864, 'min': 67108864, 'avg': 67108864, 'last': 67108864, 'last-5-avg': 67108864, 'last-10-avg': 67108864}, 'config/_tf_policy_handles_more_than_one_loss': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/_disable_preprocessor_api': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/simple_optimizer': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/monitor': {'max': -1, 'min': -1, 'avg': -1, 'last': -1, 'last-5-avg': -1, 'last-10-avg': -1}, 'config/use_critic': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/use_gae': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/lambda': {'max': 1.0, 'min': 1.0, 'avg': 1.0, 'last': 1.0, 'last-5-avg': 1.0, 'last-10-avg': 1.0}, 'config/kl_coeff': {'max': 0.2, 'min': 0.2, 'avg': 0.2, 'last': 0.2, 'last-5-avg': 0.2, 'last-10-avg': 0.2}, 'config/sgd_minibatch_size': {'max': 128, 'min': 128, 'avg': 128, 'last': 128, 'last-5-avg': 128, 'last-10-avg': 128}, 'config/shuffle_sequences': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/num_sgd_iter': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/vf_loss_coeff': {'max': 1.0, 'min': 1.0, 'avg': 1.0, 'last': 1.0, 'last-5-avg': 1.0, 'last-10-avg': 1.0}, 'config/entropy_coeff': {'max': 0.0, 'min': 0.0, 'avg': 0.0, 'last': 0.0, 'last-5-avg': 0.0, 'last-10-avg': 0.0}, 'config/clip_param': {'max': 0.3, 'min': 0.3, 'avg': 0.3, 'last': 0.3, 'last-5-avg': 0.3, 'last-10-avg': 0.3}, 'config/vf_clip_param': {'max': 10.0, 'min': 10.0, 'avg': 10.0, 'last': 10.0, 'last-5-avg': 10.0, 'last-10-avg': 10.0}, 'config/kl_target': {'max': 0.01, 'min': 0.01, 'avg': 0.01, 'last': 0.01, 'last-5-avg': 0.01, 'last-10-avg': 0.01}, 'config/vf_share_layers': {'max': -1, 'min': -1, 'avg': -1, 'last': -1, 'last-5-avg': -1, 'last-10-avg': -1}, 'perf/cpu_util_percent': {'max': 74.125, 'min': 74.125, 'avg': 74.125, 'last': 74.125, 'last-5-avg': 74.125, 'last-10-avg': 74.125}, 'perf/ram_util_percent': {'max': 17.4, 'min': 17.4, 'avg': 17.4, 'last': 17.4, 'last-5-avg': 17.4, 'last-10-avg': 17.4}, 'config/model/_use_default_native_models': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/_disable_preprocessor_api': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/free_log_std': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/no_final_linear': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/vf_share_layers': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/use_lstm': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/max_seq_len': {'max': 20, 'min': 20, 'avg': 20, 'last': 20, 'last-5-avg': 20, 'last-10-avg': 20}, 'config/model/lstm_cell_size': {'max': 256, 'min': 256, 'avg': 256, 'last': 256, 'last-5-avg': 256, 'last-10-avg': 256}, 'config/model/lstm_use_prev_action': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/lstm_use_prev_reward': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/_time_major': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/use_attention': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/attention_num_transformer_units': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/model/attention_dim': {'max': 64, 'min': 64, 'avg': 64, 'last': 64, 'last-5-avg': 64, 'last-10-avg': 64}, 'config/model/attention_num_heads': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'config/model/attention_head_dim': {'max': 32, 'min': 32, 'avg': 32, 'last': 32, 'last-5-avg': 32, 'last-10-avg': 32}, 'config/model/attention_memory_inference': {'max': 50, 'min': 50, 'avg': 50, 'last': 50, 'last-5-avg': 50, 'last-10-avg': 50}, 'config/model/attention_memory_training': {'max': 50, 'min': 50, 'avg': 50, 'last': 50, 'last-5-avg': 50, 'last-10-avg': 50}, 'config/model/attention_position_wise_mlp_dim': {'max': 32, 'min': 32, 'avg': 32, 'last': 32, 'last-5-avg': 32, 'last-10-avg': 32}, 'config/model/attention_init_gru_gate_bias': {'max': 2.0, 'min': 2.0, 'avg': 2.0, 'last': 2.0, 'last-5-avg': 2.0, 'last-10-avg': 2.0}, 'config/model/attention_use_n_prev_actions': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/model/attention_use_n_prev_rewards': {'max': 0, 'min': 0, 'avg': 0, 'last': 0, 'last-5-avg': 0, 'last-10-avg': 0}, 'config/model/framestack': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/model/dim': {'max': 84, 'min': 84, 'avg': 84, 'last': 84, 'last-5-avg': 84, 'last-10-avg': 84}, 'config/model/grayscale': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/model/zero_mean': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/model/lstm_use_prev_action_reward': {'max': -1, 'min': -1, 'avg': -1, 'last': -1, 'last-5-avg': -1, 'last-10-avg': -1}, 'config/tf_session_args/intra_op_parallelism_threads': {'max': 2, 'min': 2, 'avg': 2, 'last': 2, 'last-5-avg': 2, 'last-10-avg': 2}, 'config/tf_session_args/inter_op_parallelism_threads': {'max': 2, 'min': 2, 'avg': 2, 'last': 2, 'last-5-avg': 2, 'last-10-avg': 2}, 'config/tf_session_args/log_device_placement': {'max': False, 'min': False, 'avg': False, 'last': False, 'last-5-avg': False, 'last-10-avg': False}, 'config/tf_session_args/allow_soft_placement': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/local_tf_session_args/intra_op_parallelism_threads': {'max': 8, 'min': 8, 'avg': 8, 'last': 8, 'last-5-avg': 8, 'last-10-avg': 8}, 'config/local_tf_session_args/inter_op_parallelism_threads': {'max': 8, 'min': 8, 'avg': 8, 'last': 8, 'last-5-avg': 8, 'last-10-avg': 8}, 'config/multiagent/policy_map_capacity': {'max': 100, 'min': 100, 'avg': 100, 'last': 100, 'last-5-avg': 100, 'last-10-avg': 100}, 'config/tf_session_args/gpu_options/allow_growth': {'max': True, 'min': True, 'avg': True, 'last': True, 'last-5-avg': True, 'last-10-avg': True}, 'config/tf_session_args/device_count/CPU': {'max': 1, 'min': 1, 'avg': 1, 'last': 1, 'last-5-avg': 1, 'last-10-avg': 1}, 'info/learner/default_policy/learner_stats/allreduce_latency': {'max': 0.0, 'min': 0.0, 'avg': 0.0, 'last': 0.0, 'last-5-avg': 0.0, 'last-10-avg': 0.0}, 'info/learner/default_policy/learner_stats/cur_kl_coeff': {'max': 0.19999999999999993, 'min': 0.19999999999999993, 'avg': 0.19999999999999993, 'last': 0.19999999999999993, 'last-5-avg': 0.19999999999999993, 'last-10-avg': 0.19999999999999993}, 'info/learner/default_policy/learner_stats/cur_lr': {'max': 5.0000000000000016e-05, 'min': 5.0000000000000016e-05, 'avg': 5.0000000000000016e-05, 'last': 5.0000000000000016e-05, 'last-5-avg': 5.0000000000000016e-05, 'last-10-avg': 5.0000000000000016e-05}, 'info/learner/default_policy/learner_stats/total_loss': {'max': 289.1911126413653, 'min': 289.1911126413653, 'avg': 289.1911126413653, 'last': 289.1911126413653, 'last-5-avg': 289.1911126413653, 'last-10-avg': 289.1911126413653}, 'info/learner/default_policy/learner_stats/policy_loss': {'max': -0.014920547394262205, 'min': -0.014920547394262205, 'avg': -0.014920547394262205, 'last': -0.014920547394262205, 'last-5-avg': -0.014920547394262205, 'last-10-avg': -0.014920547394262205}, 'info/learner/default_policy/learner_stats/vf_loss': {'max': 289.2044420549947, 'min': 289.2044420549947, 'avg': 289.2044420549947, 'last': 289.2044420549947, 'last-5-avg': 289.2044420549947, 'last-10-avg': 289.2044420549947}, 'info/learner/default_policy/learner_stats/vf_explained_var': {'max': -3.1133813242758476e-05, 'min': -3.1133813242758476e-05, 'avg': -3.1133813242758476e-05, 'last': -3.1133813242758476e-05, 'last-5-avg': -3.1133813242758476e-05, 'last-10-avg': -3.1133813242758476e-05}, 'info/learner/default_policy/learner_stats/kl': {'max': 0.007963308195468869, 'min': 0.007963308195468869, 'avg': 0.007963308195468869, 'last': 0.007963308195468869, 'last-5-avg': 0.007963308195468869, 'last-10-avg': 0.007963308195468869}, 'info/learner/default_policy/learner_stats/entropy': {'max': 0.6850801167949554, 'min': 0.6850801167949554, 'avg': 0.6850801167949554, 'last': 0.6850801167949554, 'last-5-avg': 0.6850801167949554, 'last-10-avg': 0.6850801167949554}, 'info/learner/default_policy/learner_stats/entropy_coeff': {'max': 0.0, 'min': 0.0, 'avg': 0.0, 'last': 0.0, 'last-5-avg': 0.0, 'last-10-avg': 0.0}}
Python doesn't have a good way to recursively add all the nested object size. I did some break down and add up. Overall my calculation is for each trial we are looking at around 200KB from just the two dictionaries for this PPO training. This explains the memory accumulation at least from trial.py.
@gjoliver @krfricke Are these metrics all necessary? To me, things like config is pretty static and do we still want to keep it in metrics and do moving avg on it?
Still no idea on why registry.py is accumulating memory at line 168.
In my opinion this is a problem with the test, not tune. tracking 1MB of data for each Trial sounds perfectly reasonable to me. I wonder if we really need tune here. Would running RLlib in a loop for 10K times also serve the purpose?
@gjoliver do you know who owns this test? Tune side or RLlib? What is the goal of it?
@gjoliver do you know who owns this test? Tune side or RLlib? What is the goal of it?
I actually think this is probably owned by core team for detecting leaks. Sang would know better.
I think this is owned by the ML team (all long running test is owned by the ml team, but I will bring ownership of some tests to the core. This test wasn’t one of the candidates). Note that this test existed for a long time (it was there even before I joined it), so you can probably ping Eric or Richard about the quality of the test.
I think what we want at the end of the day is to
The test was originally introduced to detect leaked dead actors. 10K trials seemed like an arbitrary choice at the time.
Spending effort to cut per-trial memory usage from 1MB to 500KB feels a bit weird. In reality nobody would run nearly as many trials for a tuning session.
How about we run this with 2X the memory, or simply cut the num_samples to 5000? This shouldn't cause any loss of utility for this test, and the arbitrary memory consumption would work out better with the limit of our test instance.
That's what we did - we increased memory and the test passed, so we consider it not release blocking.
We still opened a simple PR in #19583 to cut memory usage for this test. We've seen users starting up to 100k trials in one tuning session, an 10k doesn't seem to be that rare as well.
memory breakdown at the time of crash:
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-35-26 is used (29.37 / 30.91 GB). The top 10 memory consumers are:
--
|
| PID MEM COMMAND
| 160 6.83GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dash
| 1177 6.12GiB python workloads/many_ppo.py
| 144 5.95GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=172.
| 139 3.04GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
| 134 0.57GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
| 765631 0.52GiB ray::PPO
| 765721 0.5GiB ray::RolloutWorker
| 765726 0.5GiB ray::RolloutWorker
| 765722 0.5GiB ray::RolloutWorker
| 765734 0.5GiB ray::RolloutWorker
@richardliaw fyi
@xwjiang2010 what's the commit of this wheel btw? I merged the dashboard memory optimization PR lately, and I wonder if the dashboard memory usage is still high after that PR. https://github.com/ray-project/ray/pull/19385
@rkooo567 https://s3-us-west-2.amazonaws.com/ray-wheels/releases/1.8.0/afcb3e0a41268e08c14a435c239c52ae3ea0e681/ray-1.8.0-cp37-cp37m-manylinux2014_x86_64.whl I don't see your commit there. Do you want to cp yours to 1.8.0?
Oh that's in 1.8. I think since we decided to mark it as a non release blocker, I think it is fine not to cherry pick this. Thanks for the clarification!
Re @richardliaw Around 8000 out of 10K is finished.
OK this is not a Tune issue; i think the main leakage is in the dashboard or GCS and the app doesn't even take up half of the node memory.
@rkooo567 so the dashboard leakage will not be release blocking?
Main q for me is why GCS/redis leaking so much
I think the dashboard memory leak should be fixed by this: https://github.com/ray-project/ray/pull/19385 GCS I believe @rkooo567 is looking into it right now.
Nice! Hope that got captured in 1.8 cut.
This was not included in the 1.8.0 cut, do you want to cherry-pick it? @richardliaw @scv119
I actually think that's a good idea; @scv119 ?
@richardliaw dashboard leakage has been there since the beginning of Ray history, so I am not sure if we should do that. I think risks are not so high though. cc @edoakes
Main q for me is why GCS/redis leaking so much
We will start investigating it soon.
My guess is we don't remove actor entries from GCS at all, but we want to run memory profiling on gcs first.
After this PR, the Redis memory usage has drastically decreased, and the test seems to pass. https://github.com/ray-project/ray/pull/19699
But it is pretty clear there are GCS server & dashboard memory leak that are not fixed. Should be related to cleaning up actor states from memory, but we need more investigation. But I think this is not a high priority item to fix for now.
Search before asking
Ray Component
Ray Core
What happened + What you expected to happen
Weekly many_ppo is failing due to grpc DEADLINE_EXCEEDED error. https://buildkite.com/ray-project/periodic-ci/builds/1207#69e7d975-4806-497c-bd14-0eaeba2fde30 Session log: https://beta.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_SkryRYXzKpiLTR6BeGs9is7B?command-history-section=command_history
Is my understanding correct that Ray Core uses gRPC for inter-node communication? Is this failure more of a problem of Ray Core not recovering from inter-node failures or this is something that applications should handle?
This is causing our weekly release test to be red, potentially delaying 1.8. Please help!
Reproduction script
NA
Anything else
No response
Are you willing to submit a PR?