Open gjoliver opened 1 year ago
Thanks for raising this @gjoliver . On 1. and 2.: Only policies that actually changed (got updated) should be synched (and possibly even have their global-vars updated, but I'm not sure about that).
Some thought on this: Usually, when this massive caching is used, most of the (cached) policies are not trained.
@sven1977 totally agree what you described is the ideal state. maybe I didn't make it clear, what I am seeing is that we are accessing all policies even if there is only 1 rollout worker and 1 env, and we are only training 1 policy at a time. For 3, what I meant was we seem to be updating target networks for all policies, not just the ones that were trained.
also I sometimes see weird pytorch error when policies are being stashed, here is a recent example:
(APPO pid=12428) 2023-01-05 18:23:19,998 INFO algorithm.py:994 -- Ran round 5 of parallel evaluation (10/10 episodes done)
2023-01-05 18:23:27,906 ERROR trial_runner.py:1093 -- Trial APPO_multi_cartpole_faa18_00000: Error processing event.
ray.exceptions.RayTaskError(ValueError): ray::APPO.train() (pid=12428, ip=172.18.0.3, repr=APPO)
File "/ray/python/ray/tune/trainable/trainable.py", line 367, in train
raise skipped from exception_cause(skipped)
File "/ray/python/ray/tune/trainable/trainable.py", line 364, in train
result = self.step()
File "/ray/python/ray/rllib/algorithms/algorithm.py", line 732, in step
) = self._run_one_training_iteration_and_evaluation_in_parallel()
File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2745, in _run_one_training_iteration_and_evaluation_in_parallel
train_results, train_iter_ctx = train_future.result()
File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/opt/miniconda/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2739, in <lambda>
train_future = executor.submit(lambda: self._run_one_training_iteration())
File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2651, in _run_one_training_iteration
results = self.training_step()
File "/ray/python/ray/rllib/algorithms/appo/appo.py", line 271, in training_step
self.after_train_step(train_results)
File "/ray/python/ray/rllib/algorithms/appo/appo.py", line 241, in after_train_step
lambda p, _: p.update_target()
File "/ray/python/ray/rllib/evaluation/rollout_worker.py", line 1566, in foreach_policy_to_train
for pid in self.policy_map.keys()
File "/ray/python/ray/rllib/evaluation/rollout_worker.py", line 1567, in <listcomp>
if self.is_policy_to_train is None or self.is_policy_to_train(pid, None)
File "/ray/python/ray/rllib/utils/threading.py", line 24, in wrapper
return func(self, *a, **k)
File "/ray/python/ray/rllib/policy/policy_map.py", line 128, in __getitem__
policy = self._stash_least_used_policy()
File "/ray/python/ray/rllib/policy/policy_map.py", line 270, in _stash_least_used_policy
policy_state = policy.get_state()
File "/ray/python/ray/rllib/policy/torch_mixins.py", line 98, in get_state
state = super().get_state()
File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 940, in get_state
optim_state_dict = convert_to_numpy(o.state_dict())
File "/ray/python/ray/rllib/utils/numpy.py", line 158, in convert_to_numpy
return tree.map_structure(mapping, x)
File "/opt/miniconda/lib/python3.7/site-packages/tree/__init__.py", line 435, in map_structure
[func(*args) for args in zip(*map(flatten, structures))])
File "/opt/miniconda/lib/python3.7/site-packages/tree/__init__.py", line 378, in unflatten_as
% (len(flat_structure), len(flat_sequence), structure, flat_sequence))
ValueError: Could not pack sequence. Structure had 33 elements, but flat_sequence had 31 elements. Structure: {'state': {0: {'step': tensor(1.), 'exp_avg': tensor([[ 0.0122, 0.0217, -0.0025, 0.0171, -0.0243, 0.0072, 0.0181, 0.0174,
-0.0307, -0.0154, -0.0078, -0.0103, -0.0234, -0.0006, 0.0010, -0.0028,
-0.0038, 0.0018, 0.0098, 0.0222, -0.0049, -0.0123, -0.0107, -0.0080,
-0.0121, 0.0246, 0.0128, -0.0106, -0.0119, -0.0138, 0.0152, 0.0156],
[-0.0122, -0.0217, 0.0025, -0.0171, 0.0243, -0.0072, -0.0181, -0.0174,
0.0307, 0.0154, 0.0078, 0.0103, 0.0234, 0.0006, -0.0010, 0.0028,
0.0038, -0.0018, -0.0098, -0.0222, 0.0049, 0.0123, 0.0107, 0.0080,
0.0121, -0.0246, -0.0128, 0.0106, 0.0119, 0.0138, -0.0152, -0.0156]]), 'exp_avg_sq': tensor([[1.4831e-05, 4.7282e-05, 6.0866e-07, 2.9107e-05, 5.8998e-05, 5.1565e-06,
3.2875e-05, 3.0250e-05, 9.4264e-05, 2.3790e-05, 6.1299e-06, 1.0552e-05,
5.4822e-05, 3.0379e-08, 1.0276e-07, 8.0538e-07, 1.4643e-06, 3.2547e-07,
9.6227e-06, 4.9154e-05, 2.3947e-06, 1.5020e-05, 1.1354e-05, 6.4198e-06,
1.4738e-05, 6.0275e-05, 1.6289e-05, 1.1181e-05, 1.4119e-05, 1.9008e-05,
2.3229e-05, 2.4313e-05],
[1.4831e-05, 4.7282e-05, 6.0866e-07, 2.9107e-05, 5.8998e-05, 5.1565e-06,
3.2875e-05, 3.0250e-05, 9.4264e-05, 2.3790e-05, 6.1299e-06, 1.0552e-05,
5.4822e-05, 3.0380e-08, 1.0276e-07, 8.0538e-07, 1.4643e-06, 3.2547e-07,
9.6227e-06, 4.9154e-05, 2.3947e-06, 1.5020e-05, 1.1354e-05, 6.4198e-06,
1.4738e-05, 6.0275e-05, 1.6289e-05, 1.1181e-05, 1.4119e-05, 1.9008e-05,
2.3229e-05, 2.4313e-05]])}, 1: {'step': tensor(1.), 'exp_avg': tensor([ 0.0074, -0.0074]), 'exp_avg_sq': tensor([5.5057e-06, 5.5057e-06])}, 2: {'step': tensor(1.), 'exp_avg': tensor([[ 1.9525e-06, -1.6538e-07, -5.8738e-06, -1.1485e-06],
[-1.2899e-05, 8.5857e-05, 4.5800e-05, -6.6113e-05],
[ 3.0009e-05, -1.6508e-04, -1.0369e-04, 1.2367e-04],
[-1.5651e-06, 2.4955e-05, 6.7569e-06, -2.0662e-05],
[-3.8714e-06, 9.7371e-06, 1.2423e-05, -5.9038e-06],
[-1.2008e-05, 6.1319e-05, 4.1101e-05, -4.5368e-05],
[-2.1951e-05, 1.4789e-04, 7.8088e-05, -1.1406e-04],
[-1.3584e-05, 6.5115e-05, 4.6146e-05, -4.7625e-05],
[-9.3913e-06, 3.6435e-05, 3.1194e-05, -2.5463e-05],
[ 4.5329e-06, -3.1756e-05, -1.6226e-05, 2.4611e-05],
[-1.3164e-05, 7.6140e-05, 4.5795e-05, -5.7488e-05],
[-8.5123e-06, 4.2404e-05, 2.9048e-05, -3.1235e-05],
[ 1.7833e-05, -1.0684e-04, -6.2341e-05, 8.1092e-05],
[-2.2749e-05, 1.4406e-04, 8.0168e-05, -1.1020e-04],
[ 8.6114e-06, -5.1413e-05, -3.0089e-05, 3.9003e-05],
[-9.0856e-06, 6.3439e-05, 3.2505e-05, -4.9145e-05],
[ 3.4558e-06, -3.1527e-05, -1.2974e-05, 2.5125e-05],
[ 1.5993e-06, 2.4236e-06, -4.6002e-06, -3.1658e-06],
[ 1.4900e-05, -1.1024e-04, -5.3817e-05, 8.5990e-05],
[-9.1127e-06, 6.4515e-05, 3.2675e-05, -5.0062e-05],
[-1.5572e-06, -2.7984e-06, 4.4428e-06, 3.4638e-06],
[-1.8521e-06, -7.7659e-07, 5.4949e-06, 1.9011e-06],
[-6.4692e-06, 2.7594e-05, 2.1694e-05, -1.9711e-05],
[-9.3241e-06, 4.2915e-05, 3.1527e-05, -3.1142e-05],
[ 2.7013e-05, -1.5607e-04, -9.3958e-05, 1.1782e-04],
[-7.7229e-08, 1.0381e-05, 1.0885e-06, -8.9752e-06],
[ 8.1916e-06, -4.7950e-05, -2.8543e-05, 3.6270e-05],
[-2.9626e-06, 2.8798e-06, 9.1296e-06, -5.4307e-07],
[-1.8932e-05, 1.1417e-04, 6.6245e-05, -8.6738e-05],
[ 6.5439e-06, -3.4794e-05, -2.2512e-05, 2.5921e-05],
[-1.1140e-06, 1.3402e-05, 4.4495e-06, -1.0915e-05],
[ 7.7308e-06, -4.2456e-05, -2.6707e-05, 3.1797e-05]]), 'exp_avg_sq': tensor([[3.8121e-13, 2.7349e-15, 3.4501e-12, 1.3190e-13],
[1.6638e-11, 7.3714e-10, 2.0976e-10, 4.3709e-10],
[9.0055e-11, 2.7253e-09, 1.0752e-09, 1.5295e-09],
[2.4495e-13, 6.2273e-11, 4.5655e-12, 4.2689e-11],
[1.4988e-12, 9.4810e-12, 1.5433e-11, 3.4854e-12],
[1.4419e-11, 3.7600e-10, 1.6893e-10, 2.0582e-10],
[4.8184e-11, 2.1873e-09, 6.0977e-10, 1.3010e-09],
[1.8453e-11, 4.2400e-10, 2.1294e-10, 2.2681e-10],
[8.8196e-12, 1.3275e-10, 9.7304e-11, 6.4837e-11],
[2.0547e-12, 1.0084e-10, 2.6327e-11, 6.0568e-11],
[1.7330e-11, 5.7972e-10, 2.0972e-10, 3.3048e-10],
[7.2459e-12, 1.7981e-10, 8.4379e-11, 9.7561e-11],
[3.1801e-11, 1.1415e-09, 3.8863e-10, 6.5758e-10],
[5.1753e-11, 2.0753e-09, 6.4269e-10, 1.2144e-09],
[7.4156e-12, 2.6433e-10, 9.0535e-11, 1.5212e-10],
[8.2548e-12, 4.0245e-10, 1.0566e-10, 2.4152e-10],
[1.1943e-12, 9.9394e-11, 1.6832e-11, 6.3124e-11],
[2.5578e-13, 5.8735e-13, 2.1162e-12, 1.0022e-12],
[2.2200e-11, 1.2153e-09, 2.8963e-10, 7.3941e-10],
[8.3041e-12, 4.1621e-10, 1.0676e-10, 2.5062e-10],
[2.4248e-13, 7.8310e-13, 1.9738e-12, 1.1998e-12],
[3.4303e-13, 6.0308e-14, 3.0193e-12, 3.6141e-13],
[4.1849e-12, 7.6144e-11, 4.7061e-11, 3.8851e-11],
[8.6938e-12, 1.8416e-10, 9.9393e-11, 9.6980e-11],
[7.2972e-11, 2.4358e-09, 8.8279e-10, 1.3882e-09],
[5.9642e-16, 1.0777e-11, 1.1848e-13, 8.0553e-12],
[6.7102e-12, 2.2992e-10, 8.1471e-11, 1.3155e-10],
[8.7769e-13, 8.2932e-13, 8.3348e-12, 2.9492e-14],
[3.5843e-11, 1.3035e-09, 4.3884e-10, 7.5234e-10],
[4.2823e-12, 1.2106e-10, 5.0680e-11, 6.7191e-11],
[1.2410e-13, 1.7960e-11, 1.9798e-12, 1.1914e-11],
[5.9765e-12, 1.8025e-10, 7.1325e-11, 1.0111e-10]])}, 3: {'step': tensor(1.), 'exp_avg': tensor([ 3.7066e-06, -2.5770e-05, 5.9431e-05, -3.3470e-06, -7.4919e-06,
-2.3709e-05, -4.3883e-05, -2.6757e-05, -1.8368e-05, 9.0802e-06,
-2.6127e-05, -1.6791e-05, 3.5449e-05, -4.5339e-05, 1.7115e-05,
-1.8197e-05, 7.0334e-06, 2.9974e-06, 2.9935e-05, -1.8265e-05,
-2.9118e-06, -3.5019e-06, -1.2691e-05, -1.8339e-05, 5.3611e-05,
-3.0369e-07, 1.6266e-05, -5.6640e-06, -3.7645e-05, 1.2941e-05,
-2.3163e-06, 1.5309e-05]), 'exp_avg_sq': tensor([1.3739e-12, 6.6410e-11, 3.5320e-10, 1.1202e-12, 5.6128e-12, 5.6211e-11,
1.9257e-10, 7.1594e-11, 3.3738e-11, 8.2449e-12, 6.8262e-11, 2.8193e-11,
1.2566e-10, 2.0556e-10, 2.9293e-11, 3.3113e-11, 4.9468e-12, 8.9844e-13,
8.9612e-11, 3.3359e-11, 8.4783e-13, 1.2263e-12, 1.6105e-11, 3.3630e-11,
2.8741e-10, 9.2224e-15, 2.6459e-11, 3.2080e-12, 1.4172e-10, 1.6748e-11,
5.3651e-13, 2.3437e-11])}, 4: {'step': tensor(1.), 'exp_avg': tensor([[ 0.0011, 0.0001, -0.0023, 0.0026, 0.0007, 0.0018, 0.0029, 0.0030,
-0.0004, -0.0019, -0.0021, 0.0002, -0.0022, 0.0025, -0.0021, 0.0018,
-0.0025, 0.0027, 0.0007, 0.0002, 0.0013, -0.0002, -0.0022, 0.0012,
-0.0008, 0.0003, -0.0020, -0.0010, 0.0010, 0.0018, 0.0007, -0.0014]]), 'exp_avg_sq': tensor([[1.2475e-07, 1.4849e-09, 5.3999e-07, 6.6298e-07, 4.4654e-08, 3.1181e-07,
8.1904e-07, 8.7797e-07, 1.9297e-08, 3.4853e-07, 4.6026e-07, 3.2833e-09,
4.9501e-07, 6.4667e-07, 4.2792e-07, 3.2192e-07, 6.4512e-07, 7.0653e-07,
4.4208e-08, 3.1667e-09, 1.6240e-07, 3.7648e-09, 4.8225e-07, 1.5568e-07,
6.1841e-08, 1.0797e-08, 3.8260e-07, 1.0418e-07, 9.8469e-08, 3.3583e-07,
5.3131e-08, 1.9677e-07]])}, 5: {'step': tensor(1.), 'exp_avg': tensor([-0.0013]), 'exp_avg_sq': tensor([1.6574e-07])}}, 'param_groups': [{'lr': tensor(0.0005), 'betas': (tensor(0.9000), tensor(0.9990)), 'eps': tensor(1.0000e-08), 'weight_decay': tensor(0), 'amsgrad': tensor(False), 'maximize': tensor(False), 'foreach': None, 'capturable': tensor(False), 'params': [0, 1, 2, 3, 4, 5]}]}, flat_sequence: [False, 0.8999999761581421, 0.9990000128746033, False, 9.99999993922529e-09, None, 0.0005000000237487257, False, 0, 1, 2, 3, 4, 5, 0, array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
dtype=float32), array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
dtype=float32), 1.0, array([0., 0.], dtype=float32), array([0., 0.], dtype=float32), 0.0, array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], dtype=float32), array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], dtype=float32), 0.0, array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
dtype=float32), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
dtype=float32), 0.0, array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
dtype=float32), array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
dtype=float32), 0.0, 0.0].
actually, here is another error that I sometimes see:
(APPO pid=15631) Exception in thread Thread-2:
(APPO pid=15631) Traceback (most recent call last):
(APPO pid=15631) File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 1197, in _worker
(APPO pid=15631) loss_out[opt_idx].backward(retain_graph=True)
(APPO pid=15631) File "/opt/miniconda/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
(APPO pid=15631) torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
(APPO pid=15631) File "/opt/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
(APPO pid=15631) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
(APPO pid=15631) RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [32, 1]], which is output 0 of AsStridedBackward0, is at version 1879; expected version 1878 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
(APPO pid=15631)
(APPO pid=15631) The above exception was the direct cause of the following exception:
(APPO pid=15631)
(APPO pid=15631) Traceback (most recent call last):
(APPO pid=15631) File "/opt/miniconda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
(APPO pid=15631) self.run()
(APPO pid=15631) File "/ray/python/ray/rllib/execution/learner_thread.py", line 74, in run
(APPO pid=15631) self.step()
(APPO pid=15631) File "/ray/python/ray/rllib/execution/learner_thread.py", line 91, in step
(APPO pid=15631) multi_agent_results = self.local_worker.learn_on_batch(batch)
(APPO pid=15631) File "/ray/python/ray/rllib/evaluation/rollout_worker.py", line 1025, in learn_on_batch
(APPO pid=15631) info_out[pid] = policy.learn_on_batch(batch)
(APPO pid=15631) File "/ray/python/ray/rllib/utils/threading.py", line 24, in wrapper
(APPO pid=15631) return func(self, *a, **k)
(APPO pid=15631) File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 629, in learn_on_batch
(APPO pid=15631) grads, fetches = self.compute_gradients(postprocessed_batch)
(APPO pid=15631) File "/ray/python/ray/rllib/utils/threading.py", line 24, in wrapper
(APPO pid=15631) return func(self, *a, **k)
(APPO pid=15631) File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 833, in compute_gradients
(APPO pid=15631) tower_outputs = self._multi_gpu_parallel_grad_calc([postprocessed_batch])
(APPO pid=15631) File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 1259, in _multi_gpu_parallel_grad_calc
(APPO pid=15631) raise last_result[0] from last_result[1]
(APPO pid=15631) ValueError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [32, 1]], which is output 0 of AsStridedBackward0, is at version 1879; expected version 1878 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
(APPO pid=15631) tracebackTraceback (most recent call last):
(APPO pid=15631) File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 1197, in _worker
(APPO pid=15631) loss_out[opt_idx].backward(retain_graph=True)
(APPO pid=15631) File "/opt/miniconda/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
(APPO pid=15631) torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
(APPO pid=15631) File "/opt/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
(APPO pid=15631) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
(APPO pid=15631) RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [32, 1]], which is output 0 of AsStridedBackward0, is at version 1879; expected version 1878 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
(APPO pid=15631)
(APPO pid=15631) In tower 0 on device cpu
(APPO pid=15631)
2023-01-05 18:32:28,414 ERROR trial_runner.py:1093 -- Trial APPO_multi_cartpole_fd325_00000: Error processing event.
ray.exceptions.RayTaskError(RuntimeError): ray::APPO.train() (pid=15631, ip=172.18.0.3, repr=APPO)
File "/ray/python/ray/tune/trainable/trainable.py", line 367, in train
raise skipped from exception_cause(skipped)
File "/ray/python/ray/tune/trainable/trainable.py", line 364, in train
result = self.step()
File "/ray/python/ray/rllib/algorithms/algorithm.py", line 732, in step
) = self._run_one_training_iteration_and_evaluation_in_parallel()
File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2745, in _run_one_training_iteration_and_evaluation_in_parallel
train_results, train_iter_ctx = train_future.result()
File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/opt/miniconda/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2739, in <lambda>
train_future = executor.submit(lambda: self._run_one_training_iteration())
File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2651, in _run_one_training_iteration
results = self.training_step()
File "/ray/python/ray/rllib/algorithms/appo/appo.py", line 268, in training_step
train_results = super().training_step()
File "/ray/python/ray/rllib/algorithms/impala/impala.py", line 536, in training_step
raise RuntimeError("The learner thread died while training!")
RuntimeError: The learner thread died while training!
What happened + What you expected to happen
The idea of a policy cache is to stash unused policies on disk or in object store, to alleviate memory stress. That requires our code to only access cached policies, and restore states properly when a policy is un-stashed, at all times. If we access already stashed policies blindly, we will get into a situation where useful policies get stashed then immediately un-stashed, which will slow down things significantly and unnecessarilly. While debugging things, I notice at least the following places where we are accessing ALL policies, regardless of whether they are in the cache or not:
Syncing weights from all policies to Eval workers. This will cause the local and eval workers to stash then unstash all policies. https://github.com/ray-project/ray/blob/50e1fda022a81e5015978cf723f7b5fd9cc06b2c/rllib/algorithms/algorithm.py#L816-L826
RolloutWorker seems to set global vars on all policies regardless of whether they are stashed. https://github.com/ray-project/ray/blob/50e1fda022a81e5015978cf723f7b5fd9cc06b2c/rllib/evaluation/rollout_worker.py#L1781-L1782
APPO target network update works on all trainable policies. This will cause excessive policy restoring on the training workers. https://github.com/ray-project/ray/blob/50e1fda022a81e5015978cf723f7b5fd9cc06b2c/rllib/algorithms/appo/appo.py#L239-L242
These are the places I have discovered so far. Things do seem a lot quieter if I comment out all these logics.
Versions / Dependencies
Master
Reproduction script
Add logging lines where we restore and stashing policies, then run:
bazel run rllib/learning_tests_multi_agent_cartpole_w_100_policies_appo
Issue Severity
Medium: It is a significant difficulty but I can work around it.