Closed athenawisdoms closed 3 years ago
Maybe this helps. From the docs (for context):
# === Settings for Rollout Worker processes ===
# Number of rollout worker actors to create for parallel sampling. Setting
# this to 0 will force rollouts to be done in the trainer actor.
"num_workers": 2,
# Number of environments to evaluate vectorwise per worker. This enables
# model inference batching, which can improve performance for inference
# bottlenecked workloads.
"num_envs_per_worker": 1,
And more specifically to your question (Q1):
# Divide episodes into fragments of this many steps each during rollouts.
# Sample batches of this size are collected from rollout workers and
# combined into a larger batch of `train_batch_size` for learning.
#
# For example, given rollout_fragment_length=100 and train_batch_size=1000:
# 1. RLlib collects 10 fragments of 100 steps each from rollout workers.
# 2. These fragments are concatenated and we perform an epoch of SGD.
#
# When using multiple envs per worker, the fragment size is multiplied by
# `num_envs_per_worker`. This is since we are collecting steps from
# multiple envs in parallel. For example, if num_envs_per_worker=5, then
# rollout workers will return experiences in chunks of 5*100 = 500 steps.
#
# The dataflow here can vary per algorithm. For example, PPO further
# divides the train batch into minibatches for multi-epoch SGD.
"rollout_fragment_length": 200,
https://docs.ray.io/en/latest/rllib-training.html#common-parameters
@LecJackS Thanks for pointing me in the right direction!
A few follow up questions, since I dont fully understand the comments:
If my training is not bottlenecked by inference, is it best to leave num_envs_per_worker
as 1, and just increase num_workers
for scaling out?
Here's how the various times look like for my run. Does it look like we should increase num_envs_per_worker
to maybe 2? Or is it still not really being bottlenecked by inference?
sampler_perf:
mean_env_wait_ms: 10.104230452352352
mean_inference_ms: 17.94720548711288
mean_processing_ms: 0.15161290206922245
Is it accurate to state that num_workers
determine the number of rollout workers, which interact with its own instance of the environment by doing inference on a local copy of the policy, gathering a batch of experiences in parallel. These experiences are then sent back to the main python process after every N steps, which does the policy update based on the latest batch of experiences collected from all the workers?
@LecJackS @richardliaw Wanted to follow up on this. Let's say if the rollout_fragment_length=100 and train_batch_size=1000. We have two scenarios: First scenario, number of workers = 5, num_envs_per_worker = 1 Second scenario, number of workers = 1, num_envs_per_worker = 1 With this example, how does rllib function? For the first scenario, is it that the rllib collects 2 fragments of 100 5 and concatenate together? Or the each worker will collects 10 fragments of 100 steps so we have train batch size of 1000 5? For the second scenario, is it that the rllib will collect 10 fragments of 100 steps, and concatenate to 1000?
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!
What is your question?
I am trying to speed up the training of 1 agent by using parallelism.
If you are training a DQN agent using mulitiple workers with the same agent params (epsilon, gamma, etc) and without setting any seed value, each of the workers will take different actions during exploration given the same environment state and also recall different experiences from the replay memory.
Q1: How then does the agent policy gets updated when there are multiple workers doing training all at the same time?
Q2: Additionally, if a seed value is set, will training with multiple workers (same agent config, same environment) have the same learning rate as training with just 1 worker? Since each worker have the same agent params (epsilon, gamma, etc), they will perform the same exploration actions and recall the same experiences from the replay memory.
Thank you!