ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.01k stars 5.78k forks source link

In A3C, the speed for processing samples does not necessarily increase when increasing num_workers and available cpus #2411

Closed luochao1024 closed 4 years ago

luochao1024 commented 6 years ago

System information

Describe the problem

I try to increase num_workers to make a3c algorithm process samples faster. When I have 8 cpus available and set num_workers = 7, I saw time_this_iter_s = 8.141056299209595, Here is the command I use

taskset -c 0-7 python3 train.py --run A3C --env=PongDeterministic-v4 --config='{"nu
m_workers":7, "sample_batch_size": 20}'

Here is the result printed:

TrainingResult for A3C_PongDeterministic-v4_0:
  date: 2018-07-17_03-17-25
  episode_len_mean: 955.3333333333334
  episode_reward_max: -18.0
  episode_reward_mean: -20.0
  episode_reward_min: -21.0
  episodes_total: 6
  experiment_id: 42648b4933ab416488accf930c055759
  hostname: cpu-32
  info:
    apply_time_ms: 13.457
    dispatch_time_ms: 19.801
    num_steps_sampled: 10000
    num_steps_trained: 10000
    wait_time_ms: 52.587
  node_ip: 10.128.0.5
  pid: 21411
  policy_reward_mean:
    default: -20.0
  time_this_iter_s: 8.141056299209595
  time_total_s: 102.94127798080444
  timestamp: 1531797445
  timesteps_this_iter: 1000
  timesteps_total: 10000
  training_iteration: 10

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 8/32 CPUs, 0/0 GPUs
Result logdir: /home/wangluochao93/ray_results/default
RUNNING trials:
 - A3C_PongDeterministic-v4_0:  RUNNING [pid=21411], 102 s, 10000 ts, -20 rew

Here is the cpu usage when I use the top command:

top - 03:16:31 up 44 min,  2 users,  load average: 12.98, 15.85, 14.67
Tasks: 377 total,   1 running, 376 sleeping,   0 stopped,   0 zombie
%Cpu0  : 83.7 us,  9.6 sy,  0.0 ni,  6.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 79.8 us,  6.0 sy,  0.0 ni, 12.9 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 st
%Cpu2  : 81.2 us,  8.9 sy,  0.0 ni,  9.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 83.8 us,  5.9 sy,  0.0 ni,  9.9 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu4  : 80.9 us,  6.7 sy,  0.0 ni, 11.7 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu5  : 68.4 us, 10.3 sy,  0.0 ni, 18.1 id,  0.0 wa,  0.0 hi,  3.2 si,  0.0 st
%Cpu6  : 80.7 us, 11.1 sy,  0.0 ni,  8.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  : 81.9 us, 10.0 sy,  0.0 ni,  8.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.0 us,  0.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  2.0 si,  0.0 st
%Cpu9  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu16 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu17 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu20 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu25 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu27 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu29 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu30 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu31 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 12376860+total, 11223540+free,  5318656 used,  6214532 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 11176814+

When I have 16 cpus available and set num_workers = 15, I saw time_this_iter_s = 5.766817331314087 which is less than 8.141056299209595. This means increasing the num_workers increases the speed. Here is the command I use

taskset -c 0-15 python3 train.py --run A3C --env=PongDeterministic-v4 --config='{"nu
m_workers":15, "sample_batch_size": 20}'

Here is the result printed

TrainingResult for A3C_PongDeterministic-v4_0:
  date: 2018-07-17_03-12-40
  episode_len_mean: 926.6363636363636
  episode_reward_max: -20.0
  episode_reward_mean: -20.545454545454547
  episode_reward_min: -21.0
  episodes_total: 11
  experiment_id: 2597b35ee4e744c2a8e9b4fd2549b42c
  hostname: cpu-32
  info:
    apply_time_ms: 14.291
    dispatch_time_ms: 23.486
    num_steps_sampled: 4000
    num_steps_trained: 4000
    wait_time_ms: 29.482
  node_ip: 10.128.0.5
  pid: 18777
  policy_reward_mean:
    default: -20.545454545454547
  time_this_iter_s: 5.766817331314087
  time_total_s: 33.41109848022461
  timestamp: 1531797160
  timesteps_this_iter: 1000
  timesteps_total: 4000
  training_iteration: 4

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 16/32 CPUs, 0/0 GPUs
Result logdir: /home/wangluochao93/ray_results/default
RUNNING trials:
 - A3C_PongDeterministic-v4_0:  RUNNING [pid=18777], 33 s, 4000 ts, -20.5 rew

Here is the cpu usage:

top - 03:14:07 up 41 min,  2 users,  load average: 17.77, 18.39, 15.17
Tasks: 377 total,   3 running, 374 sleeping,   0 stopped,   0 zombie
%Cpu0  : 90.4 us,  5.9 sy,  0.0 ni,  3.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 91.3 us,  7.3 sy,  0.0 ni,  1.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 91.7 us,  7.0 sy,  0.0 ni,  1.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 91.9 us,  5.4 sy,  0.0 ni,  2.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 92.4 us,  6.3 sy,  0.0 ni,  1.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu5  : 92.4 us,  6.3 sy,  0.0 ni,  1.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  : 88.7 us,  8.0 sy,  0.0 ni,  3.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu7  : 87.1 us, 10.2 sy,  0.0 ni,  2.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  : 90.0 us,  8.0 sy,  0.0 ni,  2.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  : 90.7 us,  7.0 sy,  0.0 ni,  2.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 : 91.3 us,  6.0 sy,  0.0 ni,  2.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 : 92.0 us,  6.3 sy,  0.0 ni,  1.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 : 90.6 us,  7.4 sy,  0.0 ni,  2.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 : 90.1 us,  6.6 sy,  0.0 ni,  3.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 : 86.2 us, 10.4 sy,  0.0 ni,  3.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 : 79.0 us, 16.0 sy,  0.0 ni,  4.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu16 :  0.0 us,  0.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  2.0 si,  0.0 st
%Cpu17 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu20 :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu25 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu27 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu29 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu30 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu31 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 12376860+total, 87142112 free,  5920932 used, 30705556 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 86701776 

The problem is when I have 32 cpus and 31 workers, I saw time_this_iter_s = 9.538779020309448. This is wired. I expected to seetime_this_iter_s < 5.7. Here is the command I use

taskset -c 0-31 python3 train.py --run A3C --env=PongDeterministic-v4 --config='{"nu
m_workers":31, "sample_batch_size": 20}'

TrainingResult for A3C_PongDeterministic-v4_0: date: 2018-07-17_03-08-34 episode_len_mean: 830.6470588235294 episode_reward_max: -20.0 episode_reward_mean: -20.176470588235293 episode_reward_min: -21.0 episodes_total: 17 experiment_id: 65e2aea265b1457cb94e75028fa5a503 hostname: cpu-32 info: apply_time_ms: 15.905 dispatch_time_ms: 49.127 num_steps_sampled: 13000 num_steps_trained: 13000 wait_time_ms: 22.446 node_ip: 10.128.0.5 pid: 14874 policy_reward_mean: default: -20.176470588235293 time_this_iter_s: 9.538779020309448 time_total_s: 139.73909282684326 timestamp: 1531796914 timesteps_this_iter: 1000 timesteps_total: 13000 training_iteration: 13

== Status == Using FIFO scheduling algorithm. Resources requested: 32/32 CPUs, 0/0 GPUs Result logdir: /home/wangluochao93/ray_results/default RUNNING trials:

ericl commented 6 years ago

It could be that at 32 processes you're running into contention between sibling hyperthreads on the same core, since that machine probably only has 16 physical cores, or CPU time contention between the driver and worker processes. So I wouldn't necessarily expect a gain to 32.

It's also possible the driver is starting to become a bottleneck, though your workers still seem busy so maybe it's not. Cc @richardliaw who has worked with a3c a lot more

ericl commented 6 years ago

A way to confirm this is to try using a 64 core machine.

luochao1024 commented 6 years ago

@ericl You are right. I found out this is a VM with 32 vcpus and it only has 16 cores. When I use a 32 cores machine with 64 vcpus, and set num_workers = 32, time_this_iter_s is around 5.6. This means a3c in this setting with 32 workers is just a little bit faster than 16 workers.

After I increase sample_batch_size from 20 to 200, I saw almost linearly speed up from 8 to 32 cores. So sample_batch_size also affect the efficiency.

ericl commented 6 years ago

Cool. Another config to play with is num_envs_per_worker btw.

On Wed, Jul 18, 2018, 5:24 AM luochao1024 notifications@github.com wrote:

@ericl https://github.com/ericl You are right. I found out this is a VM with 32 vcpus and it only has 16 cores. When I use a 32 cores machine with 64 vcpus, and set num_workers = 32, time_this_iter_s is around 5.6. This means a3c in this setting with 32 workers is just a little bit faster than 16 workers.

After I increase sample_batch_size from 20 to 200, I saw almost linearly speed up from 8 to 32 cores. So sample_batch_size also affect the efficiency.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2411#issuecomment-405796902, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SlZ0wLaKA_vTYMAKQDEDjVJU83Qbks5uHqpVgaJpZM4VSIO1 .

luochao1024 commented 6 years ago

@ericl I increase the num_workers to 64 and even 80. It seems like the A3C method can't converge with this large amount of workers. Have you ever seem the same result? I think it is because of the staleness of the gradient. I just want to confirm it

ericl commented 6 years ago

Yeah, I don't think A3C scales to that many workers. Cc @richardliaw

richardliaw commented 6 years ago

You may need to reduce the size of the gradient clipping, step size, and consider using VTrace or some other off policy correction. On Wed, Jul 18, 2018 at 4:42 PM Eric Liang notifications@github.com wrote:

Yeah, I don't think A3C scales to that many workers. Cc @richardliaw https://github.com/richardliaw

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2411#issuecomment-406107567, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUc5TbDcXb-DN9JRpSH6HB7Aw-XHgFpks5uH8fMgaJpZM4VSIO1 .

ericl commented 4 years ago

Auto-closing stale issue.