[rllib] Inconsistent batch size and training slowdown

roireshef commented 4 years ago

What is the problem?

There are two issues I'm seeing, trying to migrate Ray 0.7.3 to Ray 0.9.0dev:

Using the following tune config:

config.update({
    "use_pytorch": True,
    "num_workers": 10,
    "num_envs_per_worker": 5,
    "batch_mode": "complete_episodes",
    "rollout_fragment_length": 100,
    "train_batch_size": 5000,
})

I'm getting sometimes training batches of size 500, sometimes 5000, sometimes way bigger. See result snapshots from console for example:

== Status == Memory usage on this node: 8.8/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+------+----------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+------+----------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 1 | 109.081 | 519 | 0.181818 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+------+----------+

== Status == Memory usage on this node: 8.8/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+------+----------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+------+----------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 2 | 144.096 | 1052 | 0.173333 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+------+----------+

== Status == Memory usage on this node: 8.8/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+------+----------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+------+----------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 3 | 161.579 | 1581 | 0.16 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+------+----------+

== Status == Memory usage on this node: 8.9/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+-------+----------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+-------+----------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 5 | 245.442 | 10712 | 0.15 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+-------+----------+

== Status == Memory usage on this node: 9.0/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+-------+----------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+-------+----------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 7 | 301.507 | 16688 | 0.14 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+-------+----------+

== Status == Memory usage on this node: 9.0/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 11/12 CPUs, 0/1 GPUs, 0.0/36.08 GiB heap, 0.0/12.45 GiB objects Result logdir: /home/kz430x/ray_results/841d9a36-9f61-11ea-b4a3-0242ac110002 Number of trials: 10 (9 PENDING, 1 RUNNING) +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+-------+----------+ | Trial name | status | loc | gamma | lr | model/custom_options/architecture | iter | total time (s) | ts | reward | |----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+-------+----------| | A3C_FourLaneLCFTREnv_84276_00000 | RUNNING | 172.17.0.2:6603 | 0.975 | 2e-05 | VOLVONET_V3 | 8 | 332.717 | 17237 | 0.15 | | A3C_FourLaneLCFTREnv_84276_00001 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00002 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00003 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00004 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | | A3C_FourLaneLCFTREnv_84276_00005 | PENDING | | 0.975 | 2e-05 | VOLVONET_V3 | | | | | | A3C_FourLaneLCFTREnv_84276_00006 | PENDING | | 0.975 | 2e-05 | VOLVONET_V4 | | | | | | A3C_FourLaneLCFTREnv_84276_00007 | PENDING | | 0.975 | 2e-05 | VOLVONET_V5 | | | | | | A3C_FourLaneLCFTREnv_84276_00008 | PENDING | | 0.975 | 2e-05 | VOLVONET_V6 | | | | | | A3C_FourLaneLCFTREnv_84276_00009 | PENDING | | 0.975 | 2e-05 | VOLVONET_V7 | | | | | +----------------------------------+----------+-----------------+---------+-------+-------------------------------------+--------+------------------+-------+----------+

The training with same environment, same RLlib trainer, and same model takes roughly 50% more time = ~33% slowdown, compared to Ray 0.7.3.

Ray is on 0.9.0dev

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

[x] I have verified my script runs in a clean environment and reproduces the issue.
[x] I have verified the issue also occurs with the latest wheels.

roireshef commented 4 years ago

Same slowdown encountered when running Ray v0.8.5 (Pytorch, A3C) - again, compared to v0.7.3.

roireshef commented 4 years ago

@ericl / @richardliaw - Do you guys have any idea about the root cause? Anything I can do to help? This is currently a blocker for me for migrating from v0.7.3, and the newer version(s) have so much I want to try already :)

suquark commented 4 years ago

I think the performance issue could also relate to https://github.com/ray-project/ray/issues/6551 and https://github.com/ray-project/ray/issues/5856

ericl commented 4 years ago

@roireshef does this issue smooth out if you increase min_iter_time_s to some higher value such as 30s? I think it is just a reporting artifact due to the way A3C runs asynchronously (which has slightly changed in the newest version).

Btw, is there a reason to use A3C? Afaik it's strictly worse than other algorithms such as PPO or even A2C.

roireshef commented 4 years ago

@ericl so far I was setting "min_iter_time_s" to very low value, so to basically disable this mechanism because under different setups (say, different simulation step time durations) it will create totally different batch sizes. I came to realize batch sizes really make a difference, so this is something I believe is very important to be able to control explicitly and lock at a static value.

Do you think this is the only way to guarantee stable training? (in terms of batch sizes) I mean, wouldn't you expect batch sizes to be consistent throughout training, given all other configuration parameters, like sample_batch_size and grads_per_step? I really wouldn't like to use it unless I must.

Why did I choose to use A3C over other methods? This has several reasons:

Consider I'm using pytorch and would like to keep using it over Tensorflow. When I first begun using RLlib, A3C/A2C and PPO were the only pytorch implementations that were good for my use case.
Both vanilla PPO and A2C showed inferior performance when compared the async A3C version, at least in terms of wall clock time, and on my several environments. My use case is of a physical simulator (FLOW+Sumo) wrapped with substantial legacy code, so the difference between running sync and async made huge difference and basically took A2C out of the table. It might be the case that using PPO style loss for the A3C actor would be a better solution, but I believe having a critic is essential.
For future tasks, I was planning to use the value estimate in a novel way (at inference) so I restricted myself to have a critic in my solution.
I planned of migrating to IMPALA to scale out training to a full cluster (one training session) for speed. I actually waited for the IMPALA algorithm to have a pytorch version, which now seems to be true in 0.9 so I really am waiting for this issue to be resolved to start trying it.

So what do you suggest doing, given all the above?

roireshef commented 4 years ago

BTW, @ericl the issue is not just of inconsistent batch sizes, but also of a dramatic slowdown (takes roughly 50% more time for the same amount of steps trained, using the same environment implementation). As described above, this is when conparing 0.7.3 (fast) to 0.9dev (50% slower). The 0.8.5 version observes the same slowdown. I haven't tried other versions.

ericl commented 4 years ago

Maybe set min_iter_time_s to something like 15-30 seconds, or high enough so things smooth out. To be clear, the batch size is always fixed size, it's just the reporting you're seeing that is changing. Actually steps/iter is not really an important metric, instead you can look at num_timesteps_total. Iter is just the metric reporting period.

Eric

On Fri, Jul 17, 2020 at 1:19 PM roireshef notifications@github.com wrote:

@ericl https://github.com/ericl so far I was setting "min_iter_time_s" to very low value, so to basically disable this mechanism because under different setups (say, different simulation step time durations) it will create totally different batch sizes. I came to realize batch sizes really make a difference, so this is something I believe is very important to be able to control explicitly and lock at a static value.

Do you think this is the only way to guarantee stable training? (in terms of batch sizes) I mean, wouldn't you expect batch sizes to be consistent throughout training, given all other configuration parameters, like sample_batch_size and grads_per_step? I really wouldn't like to use it unless I must.

Why did I choose to use A3C over other methods? This has several reasons:

Consider I'm using pytorch and would like to keep using it over Tensorflow. When I first begun using RLlib, A3C/A2C and PPO were the only pytorch implementations that were good for my use case.

Both vanilla PPO and A2C showed inferior performance when compared the async A3C version, at least in terms of wall clock time, and on my several environments. My use case is of a physical simulator (FLOW+Sumo) wrapped with substantial legacy code, so the difference between running sync and async made huge difference and basically took A2C out of the table. It might be the case that using PPO style loss for the A3C actor would be a better solution, but I believe having a critic is essential.

For future tasks, I was planning to use the value estimate in a novel way (at inference) so I restricted myself to have a critic in my solution.

I planned of migrating to IMPALA to scale out training to a full cluster (one training session) for speed. I actually waited for the IMPALA algorithm to have a pytorch version, which now seems to be true in 0.9 so I really am waiting for this issue to be resolved to start trying it.

So what do you suggest doing, given all the above?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/8618#issuecomment-660319278, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSRVSWIMFUR3ANXJ253R4CW4HANCNFSM4NKNO4XA .

roireshef commented 4 years ago

So just to see that I'm getting it right - min_iter_time doesn't affect batch sizes at all?

I thought reporting back to user is handled at the master process (specifically in the optimizer class) whenever a batch is done processing. Do you suggest min_iter_time_s is only consumed in the optimizer, for the sake of reporting only?

ericl commented 4 years ago

That's right. It's completely reporting only. Batch size is determined by rollout frag length / train batch size.

On Fri, Jul 17, 2020, 1:31 PM roireshef notifications@github.com wrote:

So just to see that I'm getting it right - min_iter_time doesn't affect batch sizes at all?

I thought reporting back to user is handled at the master process (specifically in the optimizer class) whenever a batch is done processing. Do you suggest min_iter_time_s is only consumed in the optimizer, for the sake of reporting only?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/8618#issuecomment-660324136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSSPASBWTSNTPOJ7F33R4CYJJANCNFSM4NKNO4XA .

roireshef commented 4 years ago

BTW, @ericl the issue is not just of inconsistent batch sizes, but also of a dramatic slowdown (takes roughly 50% more time for the same amount of steps trained, using the same environment implementation). As described above, this is when conparing 0.7.3 (fast) to 0.9dev (50% slower). The 0.8.5 version observes the same slowdown. I haven't tried other versions.

Ok thanks! What about this one?

roireshef commented 4 years ago

I think the performance issue could also relate to #6551 and #5856

@suquark Those issues seem to be still hanging. Did you guys figured out a solution for the performance regression? I see deeper slowdowns than 20%, more like 40-50% (using the configuration above), so for now I'm still with 0.7.3

ericl commented 4 years ago

Is the slowdown reproducible with a benchmark env? Would be great if you could attach reproduction commands with say Pong and reported speeds.

roireshef commented 4 years ago

@ericl unfortunately I can't share the environment implementation I'm working on as it is confidential. I haven't tried to reproduce with Pong because its infrastructure is very different than using FLOW+Sumo. That said, there are trivial examples already baked into the FLOW package (https://github.com/flow-project/flow/tree/master/flow/envs) that you can use to easily reproduce the same issue. I'm pretty sure if you run a FLOW-based (with SUMO as the underlying simulator) environment and using A3C as the RL algorithm, that same issue will reproduce. I've attached a shortened version of the configuration I've been using to run my A3C experiments, above. If you need anything else please LMK.

roireshef commented 4 years ago

That's right. It's completely reporting only. Batch size is determined by rollout frag length / train batch size. … On Fri, Jul 17, 2020, 1:31 PM roireshef @.***> wrote: So just to see that I'm getting it right - min_iter_time doesn't affect batch sizes at all? I thought reporting back to user is handled at the master process (specifically in the optimizer class) whenever a batch is done processing. Do you suggest min_iter_time_s is only consumed in the optimizer, for the sake of reporting only? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8618 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSSPASBWTSNTPOJ7F33R4CYJJANCNFSM4NKNO4XA .

@ericl - could you please tell me what am I missing here? I'm fixating all other config parameters and setting min_iter_time first time to 10 [sec], the second time to 20 [sec]. The batch sizes reported (timesteps accumulated since last "iteration") are doubled. I dove into your code. This seems like the chain of events:

Trainable.train() calls self.step() and then updates and reports self._iteration += 1
Trainable.step() is implemented for A3C in a3c.py as calling AsyncGradients() and then ApplyGradients()
The last two functions are actually computing, gathering the gradients, and then updating the weights remotely, which is what A3C is indeed supposed to do in each iteration.

Since (3) happens every reported iteration once, and the timestamps accumulated are doubled when min_iter_time is doubled, it seems min_iter_time directly controls the batch size for iteration.

Are you sure your last answer is correct? If yes, what am I missing?

It is essential for the user to explicitly control the batch size. I just can't understand how to do it, and why a time-based mechanism is overriding the steps-based specification that was available so far for the user - so far batch sizes were determined by the following: grads_per_step * train_batch_size * num_envs_per_worker

I understand train_batch_size is replaced by rollout_fragment_length, but the product above (with that replacement) doesn't correspond to the effective batch size anymore. Any idea why??

ericl commented 4 years ago

Min iter time has nothing to do with the batch size under the hood, it's just the metrics reporting period. Many gradient computations and applications are happening each iteration.

roireshef commented 4 years ago

@ericl thanks Eric, so how can I finally tell what is the effective batch size for each gradient update in A3C?

ericl commented 4 years ago

It should be just rollout fragment length.

On Thu, Sep 3, 2020, 6:59 AM roireshef notifications@github.com wrote:

@ericl https://github.com/ericl thanks Eric, so how can I finally tell what is the effective batch size for each gradient update in A3C?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/8618#issuecomment-686511654, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSTDVMJZMNE2LWYKPADSD6OMTANCNFSM4NKNO4XA .

roireshef commented 4 years ago

Hi @ericl, following some deeper investigation and offline discussion with @sven1977 I now understand better the following:

My previous assumption of a single call to trainer.train() only triggers a single network update on a single batch was incorrect. In fact, I can now confirm ApplyGradients might be called several times per each trainer.train() execution.
The duration of a call to trainer.train() is set by the min_iter_time_s and so, controlling the amount of network updates (or "batches") is impossible, whereas before (say, in versions 0.7.x), every call to trainer.train() executed only a single network update based on a single batch. If you can confirm this is 100% correct, then I'd update the docstring for trainer.train() which is now invalid w.r.t that.
The logic for A3C's master worker (what was previously AsyncOptimizer in 0.7.x) is drastically changed in newer versions. This includes the following change:

in 0.7.x it was logically doing: Remote: async sampling of steps + batching of "fragments" + computing a single gradient per fragment Local: wait for grads_per_step gradients and only then update weights and publish to all remote workers

In 0.8.7 A3C is implemented as follows: Remote: the same Local: every gradient that arrives is triggering local network weights update, and only updates the weights for the remote worker that created and sent back this gradient.

So to summarize my remaining questions:

The standard A3C logic that is reported in their DeepMind's original paper was actually implemented in 0.7.x quite accurately. Can you please share where did this new logic come from? Was it proven to work better?
If I wanted to revert to the previous version of A3C, where several gradients are accumulated before an update (essentially, a batch size >> rollout_fragment_length), is there a way using 0.8.x?

roireshef commented 4 years ago

OK, new insights from playing around the whole day (for the sake of whoever is going to read this):

1.

in 0.7.x it was logically doing:
Remote: async sampling of steps + batching of "fragments" + computing a single gradient per fragment
Local: wait for grads_per_step gradients and only then update weights and publish to all remote workers
The above is a complete mistake. AsyncGradientsOptimizer was never implemented to accumulate gradients and only then apply them batched, to the local worker's model. It was actually doing the same thing as I've writen above for 0.8.x so logic hasn't changed! Sorry for all the mess.

I've implemented such a version in 0.8.7 where the local worker accumulates (sums) gradients until a "full batch" is accumulated and only then applies them and publishes the new weights to all workers. Running on a single environment implementation I could actually observed a decrement in performance (learning is less robust).
As stated previously, in the current implementation of A3C gradients are applied a single rollout_fragment at a time. Given this, an "iteration" is a reporting-only concept. The iteration can be controlled via min_iter_time_s for time-cap and via timesteps_per_iteration for sim-step-cap (which was what I was originally looking for, before I learned the meaning of Async gradients here...)
The training with same environment, same RLlib trainer, and same model takes roughly 50% more time = ~33% slowdown, compared to Ray 0.7.3.

My slowdowns in 0.8.x are eventually a consequence of not being able to run on sample_async=true mode when using PyTorch. This is a result of migrating to ModelV2 and decoupling the value function's output from .forward(). It seems like quite a lost for PyTorch users, since PyTorch is already thread-safe, and if forward would have been returning the value function estimates, sample_async could have been set to true for PyTorch as well.

I'm going to close this issue and open a new one for discussion around enabling _sampleasync for pytorch.

sven1977 commented 4 years ago

This is great! Thanks so much for digging in and getting to the bottom of this. Fixing the async race condition for forward + value calls in pytorch should not be so hard to do. we could simply have a forward_and_value method or so that returns everything in one call, used by async algos.

ray-project / ray