ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.34k stars 5.64k forks source link

CPU Cores are not fully utilized. #3433

Closed mynkpl1998 closed 4 years ago

mynkpl1998 commented 5 years ago

System information

Describe the problem

I am using rllib A3C algorithm to train an agent using 12 workers. My machine is an 32 core (12 Physical cores) machine. However, when I checked the cpu usage using htop none of the core was being fully utilized. Most of the processes are ray_Policy Evaluator which are in Sleep state for most of the time.

Source code / logs

Here is the algorithm configuration file I used.

local-view: env: tsim-v0 run: A3C checkpoint_freq: 500 config: num_workers: 12 sample_batch_size: 1 use_pytorch: false vf_loss_coeff: 0.5 entropy_coeff: -0.01 gamma: 0.99 grad_clip: 40.0 lambda: 1.0 horizon: 2000 lr: 0.0001 observation_filter: NoFilter callbacks: on_episode_end: None model: use_lstm: true

Here is the screenshot showing CPU usage.

screenshot

ericl commented 5 years ago

I think this just means that the bottleneck is applying gradients. If you want to increase the compute on the workers you can increase the sample batch size which will improve learning as well up to a point.

mynkpl1998 commented 5 years ago

Thanks, @ericl. I increased the batch size to 256 and CPU utilization of each core went up.