Sebula PPO (EnvPool's async API)

vwxyzjn commented 1 year ago

Description

Todo items:

[x] SPMD learner ✅ helps!
[x] SPS_update calculation for the actor ✅ helps!
[x] multiple threads per actor? ✅ helps!
[x] remove b_inds 🤔 same speed!
[x] jnp.array_split within JIT 🤔 same speed!
[ ] integer division issue -- async_update is off (https://wandb.ai/costa-huang/cleanRL/reports/4-vs-3-learner-devices--VmlldzozNDg4MDE5)
[ ] possibility of hanging: https://wandb.ai/openrlbenchmark/cleanrl/runs/3too0kec?workspace=user-costa-huang
[ ] can you still replicate sample efficiency result with 1GPU?

More experiments

[ ] use EnvPool's XLA interface (it's going to be faster, but likely not worth it because it limits the application to non-C++ envs)
[ ] use larger batch size...?
[ ] allow using multiple actor GPUs
[ ] larger models? (e.g., muesli paper's model size)
[ ] potentially a data worker? (e.g., dm-haiku's impala)
[ ] study the balance between actor and learner?
[ ] automatically calculate the queue size
[ ] implement something like https://github.com/alex-petrenko/faster-fifo that uses one lock to get many items.

hypothesis:

suppose we have an update function that runs on GPU1, then the execution of update will block the jax.device_put_sharded call in a separate thread that tries to put data from GPU0 to GPU1. Not sure if this is the case for TPU as well.
- ✅ related to this issue: having a dedicated GPU actor helps compared to actor using GPU 0 and learners using GPU 0 1 2 3

Types of changes

[ ] Bug fix
[ ] New feature
[ ] New algorithm
[ ] Documentation

Checklist:

[ ] I've read the CONTRIBUTION guide (required).
[ ] I have ensured pre-commit run --all-files passes (required).
[ ] I have updated the documentation and previewed the changes via mkdocs serve.
[ ] I have updated the tests accordingly (if applicable).

If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

[ ] I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
[ ] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
[ ] I have added additional documentation and previewed the changes via mkdocs serve.
- [ ] I have explained note-worthy implementation details.
- [ ] I have explained the logged metrics.
- [ ] I have added links to the original paper and related papers (if applicable).
- [ ] I have added links to the PR related to the algorithm variant.
- [ ] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- [ ] I have added the learning curves (in PNG format).
- [ ] I have added links to the tracked experiments.
- [ ] I have updated the overview sections at the docs and the repo
[ ] I have updated the tests accordingly (if applicable).

vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add your feedback	Feb 19, 2023 at 2:03AM (UTC)

vwxyzjn commented 1 year ago

Hey @51616 happy new year. This is the implementation that works with EnvPool’s async API, which requires quite a bit a refactoring. It should scale to cases where the environments are slow or/and the models are large. I’m running some benchmark experiments, but figured you might be interested in taking a look :)

51616 commented 1 year ago

@vwxyzjn Any specific part you want me to take a look at? Btw, I'm not quite familiar with async environments but I can help you review/test the code if needed.

vwxyzjn commented 1 year ago

Added cleanrl/sebulba_ppo_envpool.py for the podracer architecture — potentially helpful to #350. It does not work at all now.

Todo items:

[x] SPMD learner ✅ helps!
[x] SPS_update calculation for the actor ✅ helps!
[x] multiple threads per actor? ✅ helps!
[x] remove b_inds 🤔 same speed!
[x] jnp.array_split within JIT 🤔 same speed!
[ ] potentially a data worker? (e.g., dm-haiku's impala)
[ ] study the balance between actor and learner?
[ ] automatically calculate the queue size
[ ] implement something like https://github.com/alex-petrenko/faster-fifo that uses one lock to get many items.

hypothesis:

suppose we have an update function that runs on GPU1, then the execution of update will block the jax.device_put_sharded call in a separate thread that tries to put data from GPU0 to GPU1. Not sure if this is the case for TPU as well.

vwxyzjn commented 1 year ago

Seems to work ok now... This opens up new possibilities because we can use SPMD for learner updates.

CC @kinalmehta @51616 @shermansiu. Btw @shermansiu don't worry about using this for muesli yet — a lot of work needs to be done for this PR before it is stable and usable.

shermansiu commented 1 year ago

Thanks, good to know!

vwxyzjn commented 1 year ago

Some SPS improvement... Interestingly, using 1 GPU for both actor and learner performs just every so slightly slower than using 1 GPU for the actor and 1 GPU for the learner.

Further more c2b18b5 experimented with pmap (SPMD) and worked really well with 2 GPUs (GPU A used for inference, GPU A and B used for SPMD)! Almost twice as fast as the baseline.

vwxyzjn commented 1 year ago

`threading` is fine (`multiprocessing` not necessary)

sebulba_ppo_envpool_new.py in ab732a6 basically running the learners non-stop while trying to step the actors as fast as possible. There's no communication between the actor and learners, so I was just testing to see if the learners would slow down the actor.

There were two settings:

--actor-device-ids 0 --learner-device-ids 1 2 actor and learners have separate devices
--actor-device-ids 0 --learner-device-ids 0 1 actor shares one of the learners' devices.

The following figures suggests --actor-device-ids 0 --learner-device-ids 1 2 has much higher SPS, meaning having separate devices is key to not slowing down the actor, especially when the learning time is long. (e.g., when learning time is short, --actor-device-ids 0 --learner-device-ids 0 1 might perform just as fast)

I think under the hood this means calling multi_device_update will utilize the learners' devices and block access to those devices from other python threads, effectively slowing down the actor. However, if the actor has its own device, then the actor's speed is unaffected.

vwxyzjn commented 1 year ago

Supported multiple threads on an actor GPU, according to the podracer paper. Learning properly with multiple threads on an actor GPU is not tested for 1d85943.

To generate experience, we use (at least) one separate Python thread for each actor core [...] To make efficient use of the actor cores, it is essential that while a Python thread is stepping a batch of environments, the corresponding TPU core is not idle. This is achieved by creating multiple Python threads per actor core, each with its own batched environment. They threads alternate in using the same actor core, without manual synchronization

vwxyzjn commented 1 year ago

a93a1f5 calls jax.device_put_sharded in the learner, which makes the actor thread runs a lot faster. The learning time also suffers very little time, which is great. My hypothesis is that:

suppose we have an update function that runs on GPU1, then the execution of update will block the jax.device_put_sharded call in a separate thread that tries to put data from GPU0 to GPU1. Not sure if this is the case for TPU as well.
a93a1f5 removes the call of device_put_sharded from the actor threads, therefore unblocking the actor.

Not sure the implication is for multi-GPU or TPU.

Note This is a major difference from the podracer paper, in which I presume the jax.device_put_sharded happens in the actor threads

vwxyzjn commented 1 year ago

0eefb50 uses jax.put_device_replicated for the agent_state which slightly improves SPS.

vwxyzjn commented 1 year ago

0dd591c blocks the actor for 10 seconds during the first, second, and the third rollout. Experiments found that it could improve SPS a bit and actually improves performance as well — without this commit, the actor is always generating experiences for outdated parameters (1 update behind; see stats/param_queue_size)

vwxyzjn commented 1 year ago

Putting these items at the top of the PR.

vwxyzjn commented 1 year ago

Was able to match DM's sebubla's (https://arxiv.org/pdf/2104.06272.pdf) architecture performance to some capacity. My prototype (CleanBa PPO — stands for cleanrl's sebubla PPO) can outperform the original IMPALA (deep net setting, Espeholt et al., 2018), with 5 A100 GPUs (1 actor GPU, 4 learner GPU) and 16 CPU cores for envpool. Still WIP; need to fix some bugs.

Current results:

cleanrl_sebulba_ppo_vs_baselines-time

vwxyzjn commented 1 year ago

Got an error

                                                               2023-02-17 06:08:24.084633: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:481] Failed to disconnect from coordination service with status: DEADLINE_EXCEEDED: Deadline Exceeded
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614104.084285325","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Deadline Exceeded","grpc_status":4}. Proceeding with agent shutdown anyway.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/admin/home-costa/.cache/pypoetry/virtualenvs/cleanrl-BE0ShDkT-py3.8/lib/python3.8/site-packages/jax/_src/distributed.py", line 168, in shutdown
    global_state.shutdown()
  File "/admin/home-costa/.cache/pypoetry/virtualenvs/cleanrl-BE0ShDkT-py3.8/lib/python3.8/site-packages/jax/_src/distributed.py", line 87, in shutdown
    self.client.shutdown()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Deadline Exceeded
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614104.084285325","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Deadline Exceeded","grpc_status":4}
    2023-02-17 06:08:24.405270: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:1129] Shutdown barrier in coordination service has failed: DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: Shutdown::15630121101087999007 [type.googleapis.com/tensorflow.CoordinationServiceError='']. This suggests that at least one worker did not complete its job, or was too slow/hanging in its execution.
2023-02-17 06:08:24.405311: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:731] INTERNAL: Shutdown barrier has been passed with status: 'DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: Shutdown::15630121101087999007 [type.googleapis.com/tensorflow.CoordinationServiceError='']', but this task is not at the barrier yet. [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-02-17 06:08:24.405379: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service.cc:449] Stopping coordination service as shutdown barrier timed out and there is no service-to-client connection.
2023-02-17 06:08:53.225236: E external/org_tensorflow/tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:711] Coordination agent is in ERROR: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614133.225164768","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-02-17 06:08:53.225277: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:452] Coordination service agent in error status: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614133.225164768","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2023-02-17 06:08:53.226009: F external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.h:75] Terminating process because the coordinator detected missing heartbeats. This most likely indicates that another task died; see the other task logs for more details. Status: INVALID_ARGUMENT: Unexpected task request with task_name=/job:jax_worker/replica:0/task:0
Additional GRPC error information from remote target unknown_target_for_coordination_leader:
:{"created":"@1676614133.225164768","description":"Error received from peer ipv4:26.0.134.228:64719","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected task request with task_name=/job:jax_worker/replica:0/task:0","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
srun: error: ip-26-0-134-228: task 0: Aborted

vwxyzjn commented 1 year ago

Closed in favor of https://github.com/vwxyzjn/cleanba

vwxyzjn / cleanrl