Closed luochao1024 closed 6 years ago
I think this is due to the "auto-concat" code here, which dynamically batches up to 5 sample batches together: https://github.com/ray-project/ray/blob/master/python/ray/rllib/utils/sampler.py
If that is disabled then you should see 8000. IMO we should remove this behavior, I doubt it has advantages over just increasing the batch size @richardliaw
I've definitely run experiments in the past where I turned off this auto-concat with varying batch sizes and saw degraded performance across all sizes ...
I see. Thank you
@richardliaw you said that you saw degraded performance after turning off the auto-concat. Do you have any intuition about why it happens? Thanks
@ericl @richardliaw I am trying to turn off the auto-concat you said. would you mind providing some hints about how to turn it off? I want to have the same number of timesteps each step to ensure the same noise of the gradient. Then it would be easier for comparison in research. Thanks
The fastest way is to set self.queue = queue.Queue(5)
to self.queue = queue.Queue(1)
in
https://github.com/ray-project/ray/blob/8e687cbc9838fce0b1d8190c7851b7d573562487/python/ray/rllib/evaluation/sampler.py#L79
On Wed, Jul 4, 2018 at 4:21 PM luochao1024 notifications@github.com wrote:
@ericl https://github.com/ericl @richardliaw https://github.com/richardliaw I am trying to turn off the auto-concat you said. would you mind providing some hints about how to turn it off? I want to have the same number of timesteps each step to ensure the same noise of the gradient. Then it would be easier for comparison in research. Thanks
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2301#issuecomment-402570982, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUc5ZI_jbxUAR6omG2pjF7f100T7KJEks5uDU3ngaJpZM4U2xOI .
@richardliaw what about the SyncSampler? It seems like the SyncSampler also has the auto-concate as said in the comment.
@richardliaw I try the trick as you said, setting self.queue = queue.Queue(5)
to self.queue = queue.Queue(1)
. The command I run is python3 train.py --env=Pong-ram-v4 --run=A3C --config='{"num_workers": 15}'
. Before I do the setting, the time_step_iter is around 5000-7000. After the setting, the time_step_iter is around 1000-2000, and sometimes it is 0.0. How to understand timesteps_this_iter is 0.0? Is it possible to have exact the same timesteps_this_iter?
TrainingResult for A3C_Pong-ram-v4_0:
date: 2018-07-05_03-57-48
episode_len_mean: null
episode_reward_max: null
episode_reward_mean: null
episode_reward_min: null
episodes_total: 0
experiment_id: 42fde5830bb2405598d68d6b2e7203ac
hostname: t-16
info:
apply_time_ms: 0.869
dispatch_time_ms: 2.401
num_steps_sampled: 353000
num_steps_trained: 353000
wait_time_ms: 0.311
node_ip: 10.128.0.9
pid: 24848
policy_reward_mean: {}
time_this_iter_s: 0.3902287483215332
time_total_s: 152.18120670318604
timestamp: 1530763068
timesteps_this_iter: 0.0
timesteps_total: 343488.0
training_iteration: 353
Hm, I would keep the Queue but just remove the while loop here: https://github.com/ray-project/ray/blob/8e687cbc9838fce0b1d8190c7851b7d573562487/python/ray/rllib/evaluation/sampler.py#L124
You could alternatively set the "sample_async": False
A3C config, which will force synchronous sampling: https://github.com/ray-project/ray/blob/8aa56c12e60ba77325f6b3817ee5f0d8e1ed1a16/python/ray/rllib/agents/a3c/a3c.py#L35
@ericl @richardliaw The comment in SyncSampler here https://github.com/ray-project/ray/blob/8e687cbc9838fce0b1d8190c7851b7d573562487/python/ray/rllib/evaluation/sampler.py#L27 also says that "Batches can accumulate and the gradient can be calculated on up to 5 batches."
Is it just a mistake by accident?
Ah I'm pretty sure the SyncSampler comment is there by accident..
On Wed, Jul 4, 2018 at 10:20 PM luochao1024 notifications@github.com wrote:
The comment in SyncSampler here
https://github.com/ray-project/ray/blob/8e687cbc9838fce0b1d8190c7851b7d573562487/python/ray/rllib/evaluation/sampler.py#L27 also says that "Batches can accumulate and the gradient can be calculated on up to 5 batches."
Is it just a mistake by accident?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2301#issuecomment-402609137, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUc5RsyYjUOpqqGI_Htk_jvXTDvVbb7ks5uDaI5gaJpZM4U2xOI .
@ericl @richardliaw I set the "sample_async"=False
, and the script I use is
import ray
import a3c
ray.init()
config = {"num_workers": 16, "sample_async": False}
agent = a3c.A3CAgent(config=config, env="PongDeterministic-v4")
for i in range(100):
result = agent.train()
print("training_iteration", result.training_iteration)
print("timesteps this iter", result.timesteps_this_iter)
print("timesteps_total", result.timesteps_total)
print("time_this_iter_s", result.time_this_iter_s)
print("time_total_s", result.time_total_s)
print("episode_reward_mean", result.episode_reward_mean)
print()
I also add twoprint
statement to the a3c.py in the _train
funtion as follow:
def _train(self):
print("num_workers is", self.config["num_workers"])
print("async_sample", self.config["sample_async"])
self.optimizer.step()
FilterManager.synchronize(
self.local_evaluator.filters, self.remote_evaluators)
result = collect_metrics(self.local_evaluator, self.remote_evaluators)
result = result._replace(
info=self.optimizer.stats())
return result
The result of timesteps_this_iter is still wired. It is different for each iteration and sometimes it is 0.0. Would you mind providing a script that the timesteps_this_iter is exactly the same for each iteration?
training_iteration 12
timesteps this iter 838
timesteps_total 838.0
time_this_iter_s 2.8449437618255615
time_total_s 43.85523295402527
episode_reward_mean -20.0
num_workers is 16
async_sample False
training_iteration 13
timesteps this iter 1784
timesteps_total 2622.0
time_this_iter_s 2.8498880863189697
time_total_s 46.70512104034424
episode_reward_mean -21.0
num_workers is 16
async_sample False
training_iteration 14
timesteps this iter 4925
timesteps_total 7547.0
time_this_iter_s 2.831050157546997
time_total_s 49.536171197891235
episode_reward_mean -20.833333333333332
num_workers is 16
async_sample False
training_iteration 15
timesteps this iter 5192
timesteps_total 12739.0
time_this_iter_s 2.861128568649292
time_total_s 52.39729976654053
episode_reward_mean -20.333333333333332
num_workers is 16
async_sample False
training_iteration 16
timesteps this iter 0.0
timesteps_total 12739.0
time_this_iter_s 2.8732974529266357
time_total_s 55.27059721946716
episode_reward_mean nan
num_workers is 16
async_sample False
training_iteration 17
timesteps this iter 966
timesteps_total 13705.0
time_this_iter_s 2.8986072540283203
time_total_s 58.16920447349548
episode_reward_mean -20.0
@ericl @richardliaw I really can't understand why sometimes the timesteps_this_iter is 0.0. Does it mean that there is no learning at this iteration? Or Is it just a bug?
Sorry for the slow reply - that seems odd, and perhaps is just a bug in the bookkeeping/tracking.
Maybe consider setting num_workers to 1 to help debug with an interactive debugger? Let me know if you find something, I will have time to take a look at it tomorrow.
@richardliaw I set the num_workers to 1 and the timesteps_this_iter is still changing for different iteration. And sometimes it print out a warning RuntimeWarning: Mean of empty slice.
and then the timesteps_this_iter becomes 0.0 and the episode_reward_mean is nan. Here is the script to reproduce:
import ray
import ray.rllib.agents.a3c.a3c as a3c
ray.init()
config = {"num_workers": 1, "sample_async": False}
agent = a3c.A3CAgent(config=config, env="PongDeterministic-v4")
for i in range(100):
result = agent.train()
print("training_iteration", result.training_iteration)
print("timesteps this iter", result.timesteps_this_iter)
print("timesteps_total", result.timesteps_total)
print("time_this_iter_s", result.time_this_iter_s)
print("time_total_s", result.time_total_s)
print("episode_reward_mean", result.episode_reward_mean)
print()
Here is the result:
training_iteration 29
timesteps this iter 886
timesteps_total 28358
time_this_iter_s 13.195953130722046
time_total_s 385.9821493625641
episode_reward_mean -21.0
training_iteration 30
timesteps this iter 824
timesteps_total 29182
time_this_iter_s 13.376095294952393
time_total_s 399.3582446575165
episode_reward_mean -21.0
training_iteration 31
timesteps this iter 1763
timesteps_total 30945
time_this_iter_s 12.403135538101196
time_total_s 411.7613801956177
episode_reward_mean -20.0
/home/wangluochao93/.local/lib/python3.5/site-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty
slice.
out=out, **kwargs)
/home/wangluochao93/.local/lib/python3.5/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encou
ntered in double_scalars
ret = ret.dtype.type(ret / rcount)
training_iteration 32
timesteps this iter 0.0
timesteps_total 30945.0
time_this_iter_s 13.460538864135742
time_total_s 425.2219190597534
episode_reward_mean nan
training_iteration 33
timesteps this iter 2026
timesteps_total 32971.0
time_this_iter_s 12.732714176177979
time_total_s 437.9546332359314
episode_reward_mean -21.0
training_iteration 34
timesteps this iter 852
timesteps_total 33823.0
time_this_iter_s 12.382392168045044
time_total_s 450.33702540397644
episode_reward_mean -21.0
I know the problem, it's only reporting timesteps for completed episodes. Let me push a patch.
It makes sense. Thanks @ericl
The patch is merged, thanks for reporting this.
@ericl
It seems like the problem is not fully solved. I pull the newest version(7/12/2018) from github and install ray from source. I run the a3c method, and the timesteps_this iter
is always 1000
no matter how I set the sample_batch_size
here is the code I run:
python3 train.py --run A3C --env=PongDeterministic-v4 --config='{"num_workers": 15, "sample_batch_size": 10}'
The result is:
TrainingResult for A3C_PongDeterministic-v4_0:
date: 2018-07-12_23-21-18
episode_len_mean: 821.25
episode_reward_max: -21.0
episode_reward_mean: -21.0
episode_reward_min: -21.0
episodes_total: 4
experiment_id: 4389e6a66bee4e5ca770227864c196db
hostname: t-16
info:
apply_time_ms: 11.814
dispatch_time_ms: 36.985
num_steps_sampled: 11000
num_steps_trained: 11000
wait_time_ms: 18.523
node_ip: 10.128.0.9
pid: 24934
policy_reward_mean:
default: -21.0
time_this_iter_s: 6.12723183631897
time_total_s: 77.87390542030334
timestamp: 1531437678
timesteps_this_iter: 1000
timesteps_total: 11000
training_iteration: 11
After I change the sample_batch_size
to 50,
python3 train.py --run A3C --env=PongDeterministic-v4 --config='{"num_workers": 15, "sample_batch_size": 10}'
The result is:
TrainingResult for A3C_PongDeterministic-v4_0:
date: 2018-07-12_23-24-33
episode_len_mean: 948.7777777777778
episode_reward_max: -21.0
episode_reward_mean: -21.0
episode_reward_min: -21.0
episodes_total: 18
experiment_id: 5e547a1cc95f4fbd9a03d85474378f28
hostname: t-16
info:
apply_time_ms: 17.398
dispatch_time_ms: 32.845
num_steps_sampled: 7000
num_steps_trained: 7000
wait_time_ms: 46.135
node_ip: 10.128.0.9
pid: 26762
policy_reward_mean:
default: -21.0
time_this_iter_s: 20.60361957550049
time_total_s: 148.27957606315613
timestamp: 1531437873
timesteps_this_iter: 1000
timesteps_total: 7000
training_iteration: 7
obviously, time_this_iter_s
is changed but timesteps_this_iter
is still 1000
Fixed in https://github.com/ray-project/ray/pull/2399/files
On Fri, Jul 13, 2018 at 1:26 AM luochao1024 notifications@github.com wrote:
@ericl https://github.com/ericl It seems like the problem is not fully solved. I install the pull the newest version(7/12/2018) from github and install ray from source. I run the a3c method, and the timesteps_this iter is always 1000 no matter how I set the sample_batch_size here is the code I run: python3 train.py --run A3C --env=PongDeterministic-v4 --config='{"num_workers": 15, "sample_batch_size": 10}'
The result is:
TrainingResult for A3C_PongDeterministic-v4_0: date: 2018-07-12_23-21-18 episode_len_mean: 821.25 episode_reward_max: -21.0 episode_reward_mean: -21.0 episode_reward_min: -21.0 episodes_total: 4 experiment_id: 4389e6a66bee4e5ca770227864c196db hostname: t-16 info: apply_time_ms: 11.814 dispatch_time_ms: 36.985 num_steps_sampled: 11000 num_steps_trained: 11000 wait_time_ms: 18.523 node_ip: 10.128.0.9 pid: 24934 policy_reward_mean: default: -21.0 time_this_iter_s: 6.12723183631897 time_total_s: 77.87390542030334 timestamp: 1531437678 timesteps_this_iter: 1000 timesteps_total: 11000 training_iteration: 11
After I change the sample_batch_size to 50, python3 train.py --run A3C --env=PongDeterministic-v4 --config='{"num_workers": 15, "sample_batch_size": 10}'
The result is:
TrainingResult for A3C_PongDeterministic-v4_0: date: 2018-07-12_23-24-33 episode_len_mean: 948.7777777777778 episode_reward_max: -21.0 episode_reward_mean: -21.0 episode_reward_min: -21.0 episodes_total: 18 experiment_id: 5e547a1cc95f4fbd9a03d85474378f28 hostname: t-16 info: apply_time_ms: 17.398 dispatch_time_ms: 32.845 num_steps_sampled: 7000 num_steps_trained: 7000 wait_time_ms: 46.135 node_ip: 10.128.0.9 pid: 26762 policy_reward_mean: default: -21.0 time_this_iter_s: 20.60361957550049 time_total_s: 148.27957606315613 timestamp: 1531437873 timesteps_this_iter: 1000 timesteps_total: 7000 training_iteration: 7
obviously, time_this_iter_s is changed but timesteps_this_iter is still 1000
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2301#issuecomment-404681114, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SlY03hheaGCPH12IaDbkiW8KB--Bks5uF9s8gaJpZM4U2xOI .
System information
Describe the problem
I run the A3C example and sets num_workers = 16, batch_size = 40, grads_per_step = 200. I though on average timesteps_this_iter should be about 200*40=8000. However, based on the TrainingResult printed, on average timesteps_this_iter is 1941831/82 = 23600. I am quite confused how the timesteps_per_iter is computed even though I have read the collect_metrics function. Would someone mind providing an intuition about it?
Source code / logs