ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.35k stars 5.65k forks source link

In A3C, how to understand the relationship among timesteps_this_iter, grads_per_steps, batch_size and num_of_workers? #2301

Closed luochao1024 closed 6 years ago

luochao1024 commented 6 years ago

System information

Describe the problem

I run the A3C example and sets num_workers = 16, batch_size = 40, grads_per_step = 200. I though on average timesteps_this_iter should be about 200*40=8000. However, based on the TrainingResult printed, on average timesteps_this_iter is 1941831/82 = 23600. I am quite confused how the timesteps_per_iter is computed even though I have read the collect_metrics function. Would someone mind providing an intuition about it?

Source code / logs

result: TrainingResult(timesteps_total=1941831, done=None, info=None, episode_reward_mean=-19.42105263157895, episode
_reward_min=-21.0, episode_reward_max=-15.0, episode_len_mean=1420.8947368421052, episodes_total=19, mean_accuracy=No
ne, mean_validation_accuracy=None, mean_loss=None, neg_mean_loss=None, experiment_id='0530875ffb3b4a5fbf3668c026892d3
3', training_iteration=82, timesteps_this_iter=26997, time_this_iter_s=16.067451000213623, time_total_s=1361.16568589
2105, pid=9017, date='2018-06-25_18-48-59', timestamp=1529952539, hostname='test-16', node_ip='10.128.0.4', config={'
summarize': False, 'model': {'grayscale': True, 'dim': 42, 'channel_major': False, 'zero_mean': False}, 'use_pytorch'
: False, 'env': 'PongDeterministic-v4', 'env_config': {}, 'entropy_coeff': -0.01, 'vf_loss_coeff': 0.5, 'gamma': 0.99
, 'use_lstm': True, 'use_gpu_for_workers': False, 'batch_size': 40, 'lambda': 1.0, 'reward_filter': 'NoFilter', 'lr':
 0.0001, 'num_envs': 1, 'grad_clip': 40.0, 'observation_filter': 'NoFilter', 'optimizer': {'grads_per_step': 200}, 'n
um_workers': 16})
ericl commented 6 years ago

I think this is due to the "auto-concat" code here, which dynamically batches up to 5 sample batches together: https://github.com/ray-project/ray/blob/master/python/ray/rllib/utils/sampler.py

If that is disabled then you should see 8000. IMO we should remove this behavior, I doubt it has advantages over just increasing the batch size @richardliaw

richardliaw commented 6 years ago

I've definitely run experiments in the past where I turned off this auto-concat with varying batch sizes and saw degraded performance across all sizes ...

luochao1024 commented 6 years ago

I see. Thank you

luochao1024 commented 6 years ago

@richardliaw you said that you saw degraded performance after turning off the auto-concat. Do you have any intuition about why it happens? Thanks

luochao1024 commented 6 years ago

@ericl @richardliaw I am trying to turn off the auto-concat you said. would you mind providing some hints about how to turn it off? I want to have the same number of timesteps each step to ensure the same noise of the gradient. Then it would be easier for comparison in research. Thanks

richardliaw commented 6 years ago

The fastest way is to set self.queue = queue.Queue(5) to self.queue = queue.Queue(1) in https://github.com/ray-project/ray/blob/8e687cbc9838fce0b1d8190c7851b7d573562487/python/ray/rllib/evaluation/sampler.py#L79

On Wed, Jul 4, 2018 at 4:21 PM luochao1024 notifications@github.com wrote:

@ericl https://github.com/ericl @richardliaw https://github.com/richardliaw I am trying to turn off the auto-concat you said. would you mind providing some hints about how to turn it off? I want to have the same number of timesteps each step to ensure the same noise of the gradient. Then it would be easier for comparison in research. Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2301#issuecomment-402570982, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUc5ZI_jbxUAR6omG2pjF7f100T7KJEks5uDU3ngaJpZM4U2xOI .

luochao1024 commented 6 years ago

@richardliaw what about the SyncSampler? It seems like the SyncSampler also has the auto-concate as said in the comment.

luochao1024 commented 6 years ago

@richardliaw I try the trick as you said, setting self.queue = queue.Queue(5) to self.queue = queue.Queue(1) . The command I run is python3 train.py --env=Pong-ram-v4 --run=A3C --config='{"num_workers": 15}'. Before I do the setting, the time_step_iter is around 5000-7000. After the setting, the time_step_iter is around 1000-2000, and sometimes it is 0.0. How to understand timesteps_this_iter is 0.0? Is it possible to have exact the same timesteps_this_iter?

TrainingResult for A3C_Pong-ram-v4_0:
  date: 2018-07-05_03-57-48
  episode_len_mean: null
  episode_reward_max: null
  episode_reward_mean: null
  episode_reward_min: null
  episodes_total: 0
  experiment_id: 42fde5830bb2405598d68d6b2e7203ac
  hostname: t-16
  info:
    apply_time_ms: 0.869
    dispatch_time_ms: 2.401
    num_steps_sampled: 353000
    num_steps_trained: 353000
    wait_time_ms: 0.311
  node_ip: 10.128.0.9
  pid: 24848
  policy_reward_mean: {}
  time_this_iter_s: 0.3902287483215332
  time_total_s: 152.18120670318604
  timestamp: 1530763068
  timesteps_this_iter: 0.0
  timesteps_total: 343488.0
  training_iteration: 353
ericl commented 6 years ago

Hm, I would keep the Queue but just remove the while loop here: https://github.com/ray-project/ray/blob/8e687cbc9838fce0b1d8190c7851b7d573562487/python/ray/rllib/evaluation/sampler.py#L124

You could alternatively set the "sample_async": False A3C config, which will force synchronous sampling: https://github.com/ray-project/ray/blob/8aa56c12e60ba77325f6b3817ee5f0d8e1ed1a16/python/ray/rllib/agents/a3c/a3c.py#L35

luochao1024 commented 6 years ago

@ericl @richardliaw The comment in SyncSampler here https://github.com/ray-project/ray/blob/8e687cbc9838fce0b1d8190c7851b7d573562487/python/ray/rllib/evaluation/sampler.py#L27 also says that "Batches can accumulate and the gradient can be calculated on up to 5 batches."

Is it just a mistake by accident?

richardliaw commented 6 years ago

Ah I'm pretty sure the SyncSampler comment is there by accident..

On Wed, Jul 4, 2018 at 10:20 PM luochao1024 notifications@github.com wrote:

The comment in SyncSampler here

https://github.com/ray-project/ray/blob/8e687cbc9838fce0b1d8190c7851b7d573562487/python/ray/rllib/evaluation/sampler.py#L27 also says that "Batches can accumulate and the gradient can be calculated on up to 5 batches."

Is it just a mistake by accident?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2301#issuecomment-402609137, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUc5RsyYjUOpqqGI_Htk_jvXTDvVbb7ks5uDaI5gaJpZM4U2xOI .

luochao1024 commented 6 years ago

@ericl @richardliaw I set the "sample_async"=False, and the script I use is

import ray
import a3c

ray.init()
config = {"num_workers": 16, "sample_async": False}
agent = a3c.A3CAgent(config=config, env="PongDeterministic-v4")

for i in range(100):
    result = agent.train()
    print("training_iteration", result.training_iteration)
    print("timesteps this iter", result.timesteps_this_iter)
    print("timesteps_total", result.timesteps_total)
    print("time_this_iter_s", result.time_this_iter_s)
    print("time_total_s", result.time_total_s)
    print("episode_reward_mean", result.episode_reward_mean)
    print()

I also add twoprint statement to the a3c.py in the _train funtion as follow:

 def _train(self):
        print("num_workers is", self.config["num_workers"])
        print("async_sample", self.config["sample_async"])
        self.optimizer.step()
        FilterManager.synchronize(
            self.local_evaluator.filters, self.remote_evaluators)
        result = collect_metrics(self.local_evaluator, self.remote_evaluators)
        result = result._replace(
            info=self.optimizer.stats())
        return result

The result of timesteps_this_iter is still wired. It is different for each iteration and sometimes it is 0.0. Would you mind providing a script that the timesteps_this_iter is exactly the same for each iteration?

training_iteration 12
timesteps this iter 838
timesteps_total 838.0
time_this_iter_s 2.8449437618255615
time_total_s 43.85523295402527
episode_reward_mean -20.0

num_workers is 16
async_sample False
training_iteration 13
timesteps this iter 1784
timesteps_total 2622.0
time_this_iter_s 2.8498880863189697
time_total_s 46.70512104034424
episode_reward_mean -21.0

num_workers is 16
async_sample False
training_iteration 14
timesteps this iter 4925
timesteps_total 7547.0
time_this_iter_s 2.831050157546997
time_total_s 49.536171197891235
episode_reward_mean -20.833333333333332

num_workers is 16
async_sample False
training_iteration 15
timesteps this iter 5192
timesteps_total 12739.0
time_this_iter_s 2.861128568649292
time_total_s 52.39729976654053
episode_reward_mean -20.333333333333332

num_workers is 16
async_sample False
training_iteration 16
timesteps this iter 0.0
timesteps_total 12739.0
time_this_iter_s 2.8732974529266357
time_total_s 55.27059721946716
episode_reward_mean nan

num_workers is 16
async_sample False
training_iteration 17
timesteps this iter 966
timesteps_total 13705.0
time_this_iter_s 2.8986072540283203
time_total_s 58.16920447349548
episode_reward_mean -20.0
luochao1024 commented 6 years ago

@ericl @richardliaw I really can't understand why sometimes the timesteps_this_iter is 0.0. Does it mean that there is no learning at this iteration? Or Is it just a bug?

richardliaw commented 6 years ago

Sorry for the slow reply - that seems odd, and perhaps is just a bug in the bookkeeping/tracking.

Maybe consider setting num_workers to 1 to help debug with an interactive debugger? Let me know if you find something, I will have time to take a look at it tomorrow.

luochao1024 commented 6 years ago

@richardliaw I set the num_workers to 1 and the timesteps_this_iter is still changing for different iteration. And sometimes it print out a warning RuntimeWarning: Mean of empty slice. and then the timesteps_this_iter becomes 0.0 and the episode_reward_mean is nan. Here is the script to reproduce:

import ray
import ray.rllib.agents.a3c.a3c as a3c

ray.init()
config = {"num_workers": 1, "sample_async": False}
agent = a3c.A3CAgent(config=config, env="PongDeterministic-v4")

for i in range(100):
    result = agent.train()
    print("training_iteration", result.training_iteration)
    print("timesteps this iter", result.timesteps_this_iter)
    print("timesteps_total", result.timesteps_total)
    print("time_this_iter_s", result.time_this_iter_s)
    print("time_total_s", result.time_total_s)
    print("episode_reward_mean", result.episode_reward_mean)
    print()

Here is the result:

training_iteration 29
timesteps this iter 886
timesteps_total 28358
time_this_iter_s 13.195953130722046
time_total_s 385.9821493625641
episode_reward_mean -21.0

training_iteration 30
timesteps this iter 824
timesteps_total 29182
time_this_iter_s 13.376095294952393
time_total_s 399.3582446575165
episode_reward_mean -21.0

training_iteration 31
timesteps this iter 1763
timesteps_total 30945
time_this_iter_s 12.403135538101196
time_total_s 411.7613801956177
episode_reward_mean -20.0
/home/wangluochao93/.local/lib/python3.5/site-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty 
slice.
  out=out, **kwargs)
/home/wangluochao93/.local/lib/python3.5/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encou
ntered in double_scalars
  ret = ret.dtype.type(ret / rcount)

training_iteration 32
timesteps this iter 0.0
timesteps_total 30945.0
time_this_iter_s 13.460538864135742
time_total_s 425.2219190597534
episode_reward_mean nan

training_iteration 33
timesteps this iter 2026
timesteps_total 32971.0
time_this_iter_s 12.732714176177979
time_total_s 437.9546332359314
episode_reward_mean -21.0

training_iteration 34
timesteps this iter 852
timesteps_total 33823.0
time_this_iter_s 12.382392168045044
time_total_s 450.33702540397644
episode_reward_mean -21.0
ericl commented 6 years ago

I know the problem, it's only reporting timesteps for completed episodes. Let me push a patch.

luochao1024 commented 6 years ago

It makes sense. Thanks @ericl

ericl commented 6 years ago

The patch is merged, thanks for reporting this.

luochao1024 commented 6 years ago

@ericl It seems like the problem is not fully solved. I pull the newest version(7/12/2018) from github and install ray from source. I run the a3c method, and the timesteps_this iter is always 1000 no matter how I set the sample_batch_size here is the code I run: python3 train.py --run A3C --env=PongDeterministic-v4 --config='{"num_workers": 15, "sample_batch_size": 10}'

The result is:

TrainingResult for A3C_PongDeterministic-v4_0:
  date: 2018-07-12_23-21-18
  episode_len_mean: 821.25
  episode_reward_max: -21.0
  episode_reward_mean: -21.0
  episode_reward_min: -21.0
  episodes_total: 4
  experiment_id: 4389e6a66bee4e5ca770227864c196db
  hostname: t-16
  info:
    apply_time_ms: 11.814
    dispatch_time_ms: 36.985
    num_steps_sampled: 11000
    num_steps_trained: 11000
    wait_time_ms: 18.523
  node_ip: 10.128.0.9
  pid: 24934
  policy_reward_mean:
    default: -21.0
  time_this_iter_s: 6.12723183631897
  time_total_s: 77.87390542030334
  timestamp: 1531437678
  timesteps_this_iter: 1000
  timesteps_total: 11000
  training_iteration: 11

After I change the sample_batch_size to 50, python3 train.py --run A3C --env=PongDeterministic-v4 --config='{"num_workers": 15, "sample_batch_size": 10}'

The result is:

TrainingResult for A3C_PongDeterministic-v4_0:
  date: 2018-07-12_23-24-33
  episode_len_mean: 948.7777777777778
  episode_reward_max: -21.0
  episode_reward_mean: -21.0
  episode_reward_min: -21.0
  episodes_total: 18
  experiment_id: 5e547a1cc95f4fbd9a03d85474378f28
  hostname: t-16
  info:
    apply_time_ms: 17.398
    dispatch_time_ms: 32.845
    num_steps_sampled: 7000
    num_steps_trained: 7000
    wait_time_ms: 46.135
  node_ip: 10.128.0.9
  pid: 26762
  policy_reward_mean:
    default: -21.0
  time_this_iter_s: 20.60361957550049
  time_total_s: 148.27957606315613
  timestamp: 1531437873
  timesteps_this_iter: 1000
  timesteps_total: 7000
  training_iteration: 7

obviously, time_this_iter_s is changed but timesteps_this_iter is still 1000

ericl commented 6 years ago

Fixed in https://github.com/ray-project/ray/pull/2399/files

On Fri, Jul 13, 2018 at 1:26 AM luochao1024 notifications@github.com wrote:

@ericl https://github.com/ericl It seems like the problem is not fully solved. I install the pull the newest version(7/12/2018) from github and install ray from source. I run the a3c method, and the timesteps_this iter is always 1000 no matter how I set the sample_batch_size here is the code I run: python3 train.py --run A3C --env=PongDeterministic-v4 --config='{"num_workers": 15, "sample_batch_size": 10}'

The result is:

TrainingResult for A3C_PongDeterministic-v4_0: date: 2018-07-12_23-21-18 episode_len_mean: 821.25 episode_reward_max: -21.0 episode_reward_mean: -21.0 episode_reward_min: -21.0 episodes_total: 4 experiment_id: 4389e6a66bee4e5ca770227864c196db hostname: t-16 info: apply_time_ms: 11.814 dispatch_time_ms: 36.985 num_steps_sampled: 11000 num_steps_trained: 11000 wait_time_ms: 18.523 node_ip: 10.128.0.9 pid: 24934 policy_reward_mean: default: -21.0 time_this_iter_s: 6.12723183631897 time_total_s: 77.87390542030334 timestamp: 1531437678 timesteps_this_iter: 1000 timesteps_total: 11000 training_iteration: 11

After I change the sample_batch_size to 50, python3 train.py --run A3C --env=PongDeterministic-v4 --config='{"num_workers": 15, "sample_batch_size": 10}'

The result is:

TrainingResult for A3C_PongDeterministic-v4_0: date: 2018-07-12_23-24-33 episode_len_mean: 948.7777777777778 episode_reward_max: -21.0 episode_reward_mean: -21.0 episode_reward_min: -21.0 episodes_total: 18 experiment_id: 5e547a1cc95f4fbd9a03d85474378f28 hostname: t-16 info: apply_time_ms: 17.398 dispatch_time_ms: 32.845 num_steps_sampled: 7000 num_steps_trained: 7000 wait_time_ms: 46.135 node_ip: 10.128.0.9 pid: 26762 policy_reward_mean: default: -21.0 time_this_iter_s: 20.60361957550049 time_total_s: 148.27957606315613 timestamp: 1531437873 timesteps_this_iter: 1000 timesteps_total: 7000 training_iteration: 7

obviously, time_this_iter_s is changed but timesteps_this_iter is still 1000

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2301#issuecomment-404681114, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SlY03hheaGCPH12IaDbkiW8KB--Bks5uF9s8gaJpZM4U2xOI .