ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.06k stars 5.6k forks source link

[RLlib] Stashed policies are being accessed excessively, defeating the purpose of a policy cache #31432

Open gjoliver opened 1 year ago

gjoliver commented 1 year ago

What happened + What you expected to happen

The idea of a policy cache is to stash unused policies on disk or in object store, to alleviate memory stress. That requires our code to only access cached policies, and restore states properly when a policy is un-stashed, at all times. If we access already stashed policies blindly, we will get into a situation where useful policies get stashed then immediately un-stashed, which will slow down things significantly and unnecessarilly. While debugging things, I notice at least the following places where we are accessing ALL policies, regardless of whether they are in the cache or not:

  1. Syncing weights from all policies to Eval workers. This will cause the local and eval workers to stash then unstash all policies. https://github.com/ray-project/ray/blob/50e1fda022a81e5015978cf723f7b5fd9cc06b2c/rllib/algorithms/algorithm.py#L816-L826

  2. RolloutWorker seems to set global vars on all policies regardless of whether they are stashed. https://github.com/ray-project/ray/blob/50e1fda022a81e5015978cf723f7b5fd9cc06b2c/rllib/evaluation/rollout_worker.py#L1781-L1782

  3. APPO target network update works on all trainable policies. This will cause excessive policy restoring on the training workers. https://github.com/ray-project/ray/blob/50e1fda022a81e5015978cf723f7b5fd9cc06b2c/rllib/algorithms/appo/appo.py#L239-L242

These are the places I have discovered so far. Things do seem a lot quieter if I comment out all these logics.

Versions / Dependencies

Master

Reproduction script

Add logging lines where we restore and stashing policies, then run: bazel run rllib/learning_tests_multi_agent_cartpole_w_100_policies_appo

Issue Severity

Medium: It is a significant difficulty but I can work around it.

sven1977 commented 1 year ago

Thanks for raising this @gjoliver . On 1. and 2.: Only policies that actually changed (got updated) should be synched (and possibly even have their global-vars updated, but I'm not sure about that).

  1. True, but there is no real solution for this. If a policy is trained AND has been trained in that iteration, it must be updated.

Some thought on this: Usually, when this massive caching is used, most of the (cached) policies are not trained.

gjoliver commented 1 year ago

@sven1977 totally agree what you described is the ideal state. maybe I didn't make it clear, what I am seeing is that we are accessing all policies even if there is only 1 rollout worker and 1 env, and we are only training 1 policy at a time. For 3, what I meant was we seem to be updating target networks for all policies, not just the ones that were trained.

gjoliver commented 1 year ago

also I sometimes see weird pytorch error when policies are being stashed, here is a recent example:

(APPO pid=12428) 2023-01-05 18:23:19,998    INFO algorithm.py:994 -- Ran round 5 of parallel evaluation (10/10 episodes done)
2023-01-05 18:23:27,906 ERROR trial_runner.py:1093 -- Trial APPO_multi_cartpole_faa18_00000: Error processing event.
ray.exceptions.RayTaskError(ValueError): ray::APPO.train() (pid=12428, ip=172.18.0.3, repr=APPO)
  File "/ray/python/ray/tune/trainable/trainable.py", line 367, in train
    raise skipped from exception_cause(skipped)
  File "/ray/python/ray/tune/trainable/trainable.py", line 364, in train
    result = self.step()
  File "/ray/python/ray/rllib/algorithms/algorithm.py", line 732, in step
    ) = self._run_one_training_iteration_and_evaluation_in_parallel()
  File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2745, in _run_one_training_iteration_and_evaluation_in_parallel
    train_results, train_iter_ctx = train_future.result()
  File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/opt/miniconda/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2739, in <lambda>
    train_future = executor.submit(lambda: self._run_one_training_iteration())
  File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2651, in _run_one_training_iteration
    results = self.training_step()
  File "/ray/python/ray/rllib/algorithms/appo/appo.py", line 271, in training_step
    self.after_train_step(train_results)
  File "/ray/python/ray/rllib/algorithms/appo/appo.py", line 241, in after_train_step
    lambda p, _: p.update_target()
  File "/ray/python/ray/rllib/evaluation/rollout_worker.py", line 1566, in foreach_policy_to_train
    for pid in self.policy_map.keys()
  File "/ray/python/ray/rllib/evaluation/rollout_worker.py", line 1567, in <listcomp>
    if self.is_policy_to_train is None or self.is_policy_to_train(pid, None)
  File "/ray/python/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/ray/python/ray/rllib/policy/policy_map.py", line 128, in __getitem__
    policy = self._stash_least_used_policy()
  File "/ray/python/ray/rllib/policy/policy_map.py", line 270, in _stash_least_used_policy
    policy_state = policy.get_state()
  File "/ray/python/ray/rllib/policy/torch_mixins.py", line 98, in get_state
    state = super().get_state()
  File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 940, in get_state
    optim_state_dict = convert_to_numpy(o.state_dict())
  File "/ray/python/ray/rllib/utils/numpy.py", line 158, in convert_to_numpy
    return tree.map_structure(mapping, x)
  File "/opt/miniconda/lib/python3.7/site-packages/tree/__init__.py", line 435, in map_structure
    [func(*args) for args in zip(*map(flatten, structures))])
  File "/opt/miniconda/lib/python3.7/site-packages/tree/__init__.py", line 378, in unflatten_as
    % (len(flat_structure), len(flat_sequence), structure, flat_sequence))
ValueError: Could not pack sequence. Structure had 33 elements, but flat_sequence had 31 elements.  Structure: {'state': {0: {'step': tensor(1.), 'exp_avg': tensor([[ 0.0122,  0.0217, -0.0025,  0.0171, -0.0243,  0.0072,  0.0181,  0.0174,
         -0.0307, -0.0154, -0.0078, -0.0103, -0.0234, -0.0006,  0.0010, -0.0028,
         -0.0038,  0.0018,  0.0098,  0.0222, -0.0049, -0.0123, -0.0107, -0.0080,
         -0.0121,  0.0246,  0.0128, -0.0106, -0.0119, -0.0138,  0.0152,  0.0156],
        [-0.0122, -0.0217,  0.0025, -0.0171,  0.0243, -0.0072, -0.0181, -0.0174,
          0.0307,  0.0154,  0.0078,  0.0103,  0.0234,  0.0006, -0.0010,  0.0028,
          0.0038, -0.0018, -0.0098, -0.0222,  0.0049,  0.0123,  0.0107,  0.0080,
          0.0121, -0.0246, -0.0128,  0.0106,  0.0119,  0.0138, -0.0152, -0.0156]]), 'exp_avg_sq': tensor([[1.4831e-05, 4.7282e-05, 6.0866e-07, 2.9107e-05, 5.8998e-05, 5.1565e-06,
         3.2875e-05, 3.0250e-05, 9.4264e-05, 2.3790e-05, 6.1299e-06, 1.0552e-05,
         5.4822e-05, 3.0379e-08, 1.0276e-07, 8.0538e-07, 1.4643e-06, 3.2547e-07,
         9.6227e-06, 4.9154e-05, 2.3947e-06, 1.5020e-05, 1.1354e-05, 6.4198e-06,
         1.4738e-05, 6.0275e-05, 1.6289e-05, 1.1181e-05, 1.4119e-05, 1.9008e-05,
         2.3229e-05, 2.4313e-05],
        [1.4831e-05, 4.7282e-05, 6.0866e-07, 2.9107e-05, 5.8998e-05, 5.1565e-06,
         3.2875e-05, 3.0250e-05, 9.4264e-05, 2.3790e-05, 6.1299e-06, 1.0552e-05,
         5.4822e-05, 3.0380e-08, 1.0276e-07, 8.0538e-07, 1.4643e-06, 3.2547e-07,
         9.6227e-06, 4.9154e-05, 2.3947e-06, 1.5020e-05, 1.1354e-05, 6.4198e-06,
         1.4738e-05, 6.0275e-05, 1.6289e-05, 1.1181e-05, 1.4119e-05, 1.9008e-05,
         2.3229e-05, 2.4313e-05]])}, 1: {'step': tensor(1.), 'exp_avg': tensor([ 0.0074, -0.0074]), 'exp_avg_sq': tensor([5.5057e-06, 5.5057e-06])}, 2: {'step': tensor(1.), 'exp_avg': tensor([[ 1.9525e-06, -1.6538e-07, -5.8738e-06, -1.1485e-06],
        [-1.2899e-05,  8.5857e-05,  4.5800e-05, -6.6113e-05],
        [ 3.0009e-05, -1.6508e-04, -1.0369e-04,  1.2367e-04],
        [-1.5651e-06,  2.4955e-05,  6.7569e-06, -2.0662e-05],
        [-3.8714e-06,  9.7371e-06,  1.2423e-05, -5.9038e-06],
        [-1.2008e-05,  6.1319e-05,  4.1101e-05, -4.5368e-05],
        [-2.1951e-05,  1.4789e-04,  7.8088e-05, -1.1406e-04],
        [-1.3584e-05,  6.5115e-05,  4.6146e-05, -4.7625e-05],
        [-9.3913e-06,  3.6435e-05,  3.1194e-05, -2.5463e-05],
        [ 4.5329e-06, -3.1756e-05, -1.6226e-05,  2.4611e-05],
        [-1.3164e-05,  7.6140e-05,  4.5795e-05, -5.7488e-05],
        [-8.5123e-06,  4.2404e-05,  2.9048e-05, -3.1235e-05],
        [ 1.7833e-05, -1.0684e-04, -6.2341e-05,  8.1092e-05],
        [-2.2749e-05,  1.4406e-04,  8.0168e-05, -1.1020e-04],
        [ 8.6114e-06, -5.1413e-05, -3.0089e-05,  3.9003e-05],
        [-9.0856e-06,  6.3439e-05,  3.2505e-05, -4.9145e-05],
        [ 3.4558e-06, -3.1527e-05, -1.2974e-05,  2.5125e-05],
        [ 1.5993e-06,  2.4236e-06, -4.6002e-06, -3.1658e-06],
        [ 1.4900e-05, -1.1024e-04, -5.3817e-05,  8.5990e-05],
        [-9.1127e-06,  6.4515e-05,  3.2675e-05, -5.0062e-05],
        [-1.5572e-06, -2.7984e-06,  4.4428e-06,  3.4638e-06],
        [-1.8521e-06, -7.7659e-07,  5.4949e-06,  1.9011e-06],
        [-6.4692e-06,  2.7594e-05,  2.1694e-05, -1.9711e-05],
        [-9.3241e-06,  4.2915e-05,  3.1527e-05, -3.1142e-05],
        [ 2.7013e-05, -1.5607e-04, -9.3958e-05,  1.1782e-04],
        [-7.7229e-08,  1.0381e-05,  1.0885e-06, -8.9752e-06],
        [ 8.1916e-06, -4.7950e-05, -2.8543e-05,  3.6270e-05],
        [-2.9626e-06,  2.8798e-06,  9.1296e-06, -5.4307e-07],
        [-1.8932e-05,  1.1417e-04,  6.6245e-05, -8.6738e-05],
        [ 6.5439e-06, -3.4794e-05, -2.2512e-05,  2.5921e-05],
        [-1.1140e-06,  1.3402e-05,  4.4495e-06, -1.0915e-05],
        [ 7.7308e-06, -4.2456e-05, -2.6707e-05,  3.1797e-05]]), 'exp_avg_sq': tensor([[3.8121e-13, 2.7349e-15, 3.4501e-12, 1.3190e-13],
        [1.6638e-11, 7.3714e-10, 2.0976e-10, 4.3709e-10],
        [9.0055e-11, 2.7253e-09, 1.0752e-09, 1.5295e-09],
        [2.4495e-13, 6.2273e-11, 4.5655e-12, 4.2689e-11],
        [1.4988e-12, 9.4810e-12, 1.5433e-11, 3.4854e-12],
        [1.4419e-11, 3.7600e-10, 1.6893e-10, 2.0582e-10],
        [4.8184e-11, 2.1873e-09, 6.0977e-10, 1.3010e-09],
        [1.8453e-11, 4.2400e-10, 2.1294e-10, 2.2681e-10],
        [8.8196e-12, 1.3275e-10, 9.7304e-11, 6.4837e-11],
        [2.0547e-12, 1.0084e-10, 2.6327e-11, 6.0568e-11],
        [1.7330e-11, 5.7972e-10, 2.0972e-10, 3.3048e-10],
        [7.2459e-12, 1.7981e-10, 8.4379e-11, 9.7561e-11],
        [3.1801e-11, 1.1415e-09, 3.8863e-10, 6.5758e-10],
        [5.1753e-11, 2.0753e-09, 6.4269e-10, 1.2144e-09],
        [7.4156e-12, 2.6433e-10, 9.0535e-11, 1.5212e-10],
        [8.2548e-12, 4.0245e-10, 1.0566e-10, 2.4152e-10],
        [1.1943e-12, 9.9394e-11, 1.6832e-11, 6.3124e-11],
        [2.5578e-13, 5.8735e-13, 2.1162e-12, 1.0022e-12],
        [2.2200e-11, 1.2153e-09, 2.8963e-10, 7.3941e-10],
        [8.3041e-12, 4.1621e-10, 1.0676e-10, 2.5062e-10],
        [2.4248e-13, 7.8310e-13, 1.9738e-12, 1.1998e-12],
        [3.4303e-13, 6.0308e-14, 3.0193e-12, 3.6141e-13],
        [4.1849e-12, 7.6144e-11, 4.7061e-11, 3.8851e-11],
        [8.6938e-12, 1.8416e-10, 9.9393e-11, 9.6980e-11],
        [7.2972e-11, 2.4358e-09, 8.8279e-10, 1.3882e-09],
        [5.9642e-16, 1.0777e-11, 1.1848e-13, 8.0553e-12],
        [6.7102e-12, 2.2992e-10, 8.1471e-11, 1.3155e-10],
        [8.7769e-13, 8.2932e-13, 8.3348e-12, 2.9492e-14],
        [3.5843e-11, 1.3035e-09, 4.3884e-10, 7.5234e-10],
        [4.2823e-12, 1.2106e-10, 5.0680e-11, 6.7191e-11],
        [1.2410e-13, 1.7960e-11, 1.9798e-12, 1.1914e-11],
        [5.9765e-12, 1.8025e-10, 7.1325e-11, 1.0111e-10]])}, 3: {'step': tensor(1.), 'exp_avg': tensor([ 3.7066e-06, -2.5770e-05,  5.9431e-05, -3.3470e-06, -7.4919e-06,
        -2.3709e-05, -4.3883e-05, -2.6757e-05, -1.8368e-05,  9.0802e-06,
        -2.6127e-05, -1.6791e-05,  3.5449e-05, -4.5339e-05,  1.7115e-05,
        -1.8197e-05,  7.0334e-06,  2.9974e-06,  2.9935e-05, -1.8265e-05,
        -2.9118e-06, -3.5019e-06, -1.2691e-05, -1.8339e-05,  5.3611e-05,
        -3.0369e-07,  1.6266e-05, -5.6640e-06, -3.7645e-05,  1.2941e-05,
        -2.3163e-06,  1.5309e-05]), 'exp_avg_sq': tensor([1.3739e-12, 6.6410e-11, 3.5320e-10, 1.1202e-12, 5.6128e-12, 5.6211e-11,
        1.9257e-10, 7.1594e-11, 3.3738e-11, 8.2449e-12, 6.8262e-11, 2.8193e-11,
        1.2566e-10, 2.0556e-10, 2.9293e-11, 3.3113e-11, 4.9468e-12, 8.9844e-13,
        8.9612e-11, 3.3359e-11, 8.4783e-13, 1.2263e-12, 1.6105e-11, 3.3630e-11,
        2.8741e-10, 9.2224e-15, 2.6459e-11, 3.2080e-12, 1.4172e-10, 1.6748e-11,
        5.3651e-13, 2.3437e-11])}, 4: {'step': tensor(1.), 'exp_avg': tensor([[ 0.0011,  0.0001, -0.0023,  0.0026,  0.0007,  0.0018,  0.0029,  0.0030,
         -0.0004, -0.0019, -0.0021,  0.0002, -0.0022,  0.0025, -0.0021,  0.0018,
         -0.0025,  0.0027,  0.0007,  0.0002,  0.0013, -0.0002, -0.0022,  0.0012,
         -0.0008,  0.0003, -0.0020, -0.0010,  0.0010,  0.0018,  0.0007, -0.0014]]), 'exp_avg_sq': tensor([[1.2475e-07, 1.4849e-09, 5.3999e-07, 6.6298e-07, 4.4654e-08, 3.1181e-07,
         8.1904e-07, 8.7797e-07, 1.9297e-08, 3.4853e-07, 4.6026e-07, 3.2833e-09,
         4.9501e-07, 6.4667e-07, 4.2792e-07, 3.2192e-07, 6.4512e-07, 7.0653e-07,
         4.4208e-08, 3.1667e-09, 1.6240e-07, 3.7648e-09, 4.8225e-07, 1.5568e-07,
         6.1841e-08, 1.0797e-08, 3.8260e-07, 1.0418e-07, 9.8469e-08, 3.3583e-07,
         5.3131e-08, 1.9677e-07]])}, 5: {'step': tensor(1.), 'exp_avg': tensor([-0.0013]), 'exp_avg_sq': tensor([1.6574e-07])}}, 'param_groups': [{'lr': tensor(0.0005), 'betas': (tensor(0.9000), tensor(0.9990)), 'eps': tensor(1.0000e-08), 'weight_decay': tensor(0), 'amsgrad': tensor(False), 'maximize': tensor(False), 'foreach': None, 'capturable': tensor(False), 'params': [0, 1, 2, 3, 4, 5]}]}, flat_sequence: [False, 0.8999999761581421, 0.9990000128746033, False, 9.99999993922529e-09, None, 0.0005000000237487257, False, 0, 1, 2, 3, 4, 5, 0, array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
      dtype=float32), array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
      dtype=float32), 1.0, array([0., 0.], dtype=float32), array([0., 0.], dtype=float32), 0.0, array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]], dtype=float32), array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]], dtype=float32), 0.0, array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32), 0.0, array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
      dtype=float32), array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
      dtype=float32), 0.0, 0.0].
gjoliver commented 1 year ago

actually, here is another error that I sometimes see:

(APPO pid=15631) Exception in thread Thread-2:
(APPO pid=15631) Traceback (most recent call last):
(APPO pid=15631)   File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 1197, in _worker
(APPO pid=15631)     loss_out[opt_idx].backward(retain_graph=True)
(APPO pid=15631)   File "/opt/miniconda/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
(APPO pid=15631)     torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
(APPO pid=15631)   File "/opt/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
(APPO pid=15631)     allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
(APPO pid=15631) RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [32, 1]], which is output 0 of AsStridedBackward0, is at version 1879; expected version 1878 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
(APPO pid=15631) 
(APPO pid=15631) The above exception was the direct cause of the following exception:
(APPO pid=15631) 
(APPO pid=15631) Traceback (most recent call last):
(APPO pid=15631)   File "/opt/miniconda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
(APPO pid=15631)     self.run()
(APPO pid=15631)   File "/ray/python/ray/rllib/execution/learner_thread.py", line 74, in run
(APPO pid=15631)     self.step()
(APPO pid=15631)   File "/ray/python/ray/rllib/execution/learner_thread.py", line 91, in step
(APPO pid=15631)     multi_agent_results = self.local_worker.learn_on_batch(batch)
(APPO pid=15631)   File "/ray/python/ray/rllib/evaluation/rollout_worker.py", line 1025, in learn_on_batch
(APPO pid=15631)     info_out[pid] = policy.learn_on_batch(batch)
(APPO pid=15631)   File "/ray/python/ray/rllib/utils/threading.py", line 24, in wrapper
(APPO pid=15631)     return func(self, *a, **k)
(APPO pid=15631)   File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 629, in learn_on_batch
(APPO pid=15631)     grads, fetches = self.compute_gradients(postprocessed_batch)
(APPO pid=15631)   File "/ray/python/ray/rllib/utils/threading.py", line 24, in wrapper
(APPO pid=15631)     return func(self, *a, **k)
(APPO pid=15631)   File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 833, in compute_gradients
(APPO pid=15631)     tower_outputs = self._multi_gpu_parallel_grad_calc([postprocessed_batch])
(APPO pid=15631)   File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 1259, in _multi_gpu_parallel_grad_calc
(APPO pid=15631)     raise last_result[0] from last_result[1]
(APPO pid=15631) ValueError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [32, 1]], which is output 0 of AsStridedBackward0, is at version 1879; expected version 1878 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
(APPO pid=15631)  tracebackTraceback (most recent call last):
(APPO pid=15631)   File "/ray/python/ray/rllib/policy/torch_policy_v2.py", line 1197, in _worker
(APPO pid=15631)     loss_out[opt_idx].backward(retain_graph=True)
(APPO pid=15631)   File "/opt/miniconda/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
(APPO pid=15631)     torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
(APPO pid=15631)   File "/opt/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
(APPO pid=15631)     allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
(APPO pid=15631) RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [32, 1]], which is output 0 of AsStridedBackward0, is at version 1879; expected version 1878 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
(APPO pid=15631) 
(APPO pid=15631) In tower 0 on device cpu
(APPO pid=15631) 
2023-01-05 18:32:28,414 ERROR trial_runner.py:1093 -- Trial APPO_multi_cartpole_fd325_00000: Error processing event.
ray.exceptions.RayTaskError(RuntimeError): ray::APPO.train() (pid=15631, ip=172.18.0.3, repr=APPO)
  File "/ray/python/ray/tune/trainable/trainable.py", line 367, in train
    raise skipped from exception_cause(skipped)
  File "/ray/python/ray/tune/trainable/trainable.py", line 364, in train
    result = self.step()
  File "/ray/python/ray/rllib/algorithms/algorithm.py", line 732, in step
    ) = self._run_one_training_iteration_and_evaluation_in_parallel()
  File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2745, in _run_one_training_iteration_and_evaluation_in_parallel
    train_results, train_iter_ctx = train_future.result()
  File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/opt/miniconda/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2739, in <lambda>
    train_future = executor.submit(lambda: self._run_one_training_iteration())
  File "/ray/python/ray/rllib/algorithms/algorithm.py", line 2651, in _run_one_training_iteration
    results = self.training_step()
  File "/ray/python/ray/rllib/algorithms/appo/appo.py", line 268, in training_step
    train_results = super().training_step()
  File "/ray/python/ray/rllib/algorithms/impala/impala.py", line 536, in training_step
    raise RuntimeError("The learner thread died while training!")
RuntimeError: The learner thread died while training!