Closed luochao1024 closed 4 years ago
It could be that at 32 processes you're running into contention between sibling hyperthreads on the same core, since that machine probably only has 16 physical cores, or CPU time contention between the driver and worker processes. So I wouldn't necessarily expect a gain to 32.
It's also possible the driver is starting to become a bottleneck, though your workers still seem busy so maybe it's not. Cc @richardliaw who has worked with a3c a lot more
A way to confirm this is to try using a 64 core machine.
@ericl You are right. I found out this is a VM with 32 vcpus and it only has 16 cores. When I use a 32 cores machine with 64 vcpus, and set num_workers = 32, time_this_iter_s is around 5.6. This means a3c in this setting with 32 workers is just a little bit faster than 16 workers.
After I increase sample_batch_size from 20 to 200, I saw almost linearly speed up from 8 to 32 cores. So sample_batch_size also affect the efficiency.
Cool. Another config to play with is num_envs_per_worker btw.
On Wed, Jul 18, 2018, 5:24 AM luochao1024 notifications@github.com wrote:
@ericl https://github.com/ericl You are right. I found out this is a VM with 32 vcpus and it only has 16 cores. When I use a 32 cores machine with 64 vcpus, and set num_workers = 32, time_this_iter_s is around 5.6. This means a3c in this setting with 32 workers is just a little bit faster than 16 workers.
After I increase sample_batch_size from 20 to 200, I saw almost linearly speed up from 8 to 32 cores. So sample_batch_size also affect the efficiency.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2411#issuecomment-405796902, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SlZ0wLaKA_vTYMAKQDEDjVJU83Qbks5uHqpVgaJpZM4VSIO1 .
@ericl I increase the num_workers to 64 and even 80. It seems like the A3C method can't converge with this large amount of workers. Have you ever seem the same result? I think it is because of the staleness of the gradient. I just want to confirm it
Yeah, I don't think A3C scales to that many workers. Cc @richardliaw
You may need to reduce the size of the gradient clipping, step size, and consider using VTrace or some other off policy correction. On Wed, Jul 18, 2018 at 4:42 PM Eric Liang notifications@github.com wrote:
Yeah, I don't think A3C scales to that many workers. Cc @richardliaw https://github.com/richardliaw
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2411#issuecomment-406107567, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUc5TbDcXb-DN9JRpSH6HB7Aw-XHgFpks5uH8fMgaJpZM4VSIO1 .
Auto-closing stale issue.
System information
Describe the problem
I try to increase num_workers to make a3c algorithm process samples faster. When I have 8 cpus available and set
num_workers = 7
, I sawtime_this_iter_s = 8.141056299209595
, Here is the command I useHere is the result printed:
Here is the cpu usage when I use the
top
command:When I have 16 cpus available and set
num_workers = 15
, I sawtime_this_iter_s = 5.766817331314087
which is less than8.141056299209595
. This means increasing the num_workers increases the speed. Here is the command I useHere is the result printed
Here is the cpu usage:
The problem is when I have 32 cpus and 31 workers, I saw
time_this_iter_s = 9.538779020309448
. This is wired. I expected to seetime_this_iter_s < 5.7
. Here is the command I useTrainingResult for A3C_PongDeterministic-v4_0: date: 2018-07-17_03-08-34 episode_len_mean: 830.6470588235294 episode_reward_max: -20.0 episode_reward_mean: -20.176470588235293 episode_reward_min: -21.0 episodes_total: 17 experiment_id: 65e2aea265b1457cb94e75028fa5a503 hostname: cpu-32 info: apply_time_ms: 15.905 dispatch_time_ms: 49.127 num_steps_sampled: 13000 num_steps_trained: 13000 wait_time_ms: 22.446 node_ip: 10.128.0.5 pid: 14874 policy_reward_mean: default: -20.176470588235293 time_this_iter_s: 9.538779020309448 time_total_s: 139.73909282684326 timestamp: 1531796914 timesteps_this_iter: 1000 timesteps_total: 13000 training_iteration: 13
== Status == Using FIFO scheduling algorithm. Resources requested: 32/32 CPUs, 0/0 GPUs Result logdir: /home/wangluochao93/ray_results/default RUNNING trials:
top - 03:10:11 up 37 min, 2 users, load average: 22.74, 17.59, 13.88 Tasks: 377 total, 3 running, 374 sleeping, 0 stopped, 0 zombie %Cpu0 : 89.3 us, 4.0 sy, 0.0 ni, 6.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 87.8 us, 4.6 sy, 0.0 ni, 7.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 77.9 us, 5.5 sy, 0.0 ni, 15.0 id, 0.0 wa, 0.0 hi, 1.6 si, 0.0 st %Cpu3 : 61.9 us, 10.6 sy, 0.0 ni, 27.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 81.1 us, 4.7 sy, 0.0 ni, 14.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 78.1 us, 3.0 sy, 0.0 ni, 18.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 74.8 us, 2.3 sy, 0.0 ni, 22.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 54.8 us, 6.9 sy, 0.0 ni, 38.0 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu8 : 88.0 us, 3.3 sy, 0.0 ni, 8.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 67.9 us, 3.6 sy, 0.0 ni, 28.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 70.8 us, 5.3 sy, 0.0 ni, 23.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 80.1 us, 3.0 sy, 0.0 ni, 16.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 75.6 us, 3.3 sy, 0.0 ni, 21.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 77.9 us, 4.3 sy, 0.0 ni, 17.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 74.3 us, 7.1 sy, 0.0 ni, 18.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 89.0 us, 3.3 sy, 0.0 ni, 7.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu16 : 58.3 us, 6.3 sy, 0.0 ni, 35.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu17 : 61.8 us, 6.3 sy, 0.0 ni, 31.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu18 : 64.1 us, 4.7 sy, 0.0 ni, 30.9 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu19 : 91.7 us, 3.6 sy, 0.0 ni, 4.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu20 : 84.8 us, 3.0 sy, 0.0 ni, 12.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu21 : 81.1 us, 2.0 sy, 0.0 ni, 16.6 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu22 : 77.2 us, 5.0 sy, 0.0 ni, 17.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu23 : 80.5 us, 5.3 sy, 0.0 ni, 13.9 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu24 : 61.9 us, 5.2 sy, 0.0 ni, 32.6 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu25 : 96.3 us, 1.7 sy, 0.0 ni, 2.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu26 : 76.8 us, 4.0 sy, 0.0 ni, 19.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu27 : 79.0 us, 6.0 sy, 0.0 ni, 15.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu28 : 81.7 us, 7.0 sy, 0.0 ni, 11.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu29 : 85.1 us, 5.3 sy, 0.0 ni, 9.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu30 : 82.7 us, 3.3 sy, 0.0 ni, 14.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu31 : 81.1 us, 3.0 sy, 0.0 ni, 15.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 12376860+total, 82327936 free, 7073908 used, 34366760 buff/cache KiB Swap: 0 total, 0 free, 0 used. 81890888